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Outline 


•  Introduction: 

-  Denial  of  Service  -  The  Internet  Bottleneck  problem 

•  The  Architecture 

-  System  Architecture 

-  OpenIMP  platform 

-  DDos  Detection  Metrics 

-  Detection  using  Latent  Semantic  Indexing  and  Clustering 

•  Conclusion: 

-  How  does  IPFIX  support  the  integration  of  new  metrics 
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The  DDoS  Problem 


DDoS  Flooding  attacks 
saturate  the  final  link(s) 

Filters  are  only  effective 
before  the  bandwidth 
becomes  scarce 


M 


Attacker 


Hence,  the  end  user  can 
hardly  take  effective  measures 


Target 
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Mitigating  DDoS  at  ISP  level 


•  Mitigation  can  be  effective  when 
implemented  on  ISP  and/or  core 
routers 

•  This  requires 


{1.000-  \  128 kb 
100.000) 


Internet 

Backbone 


-  high-speed  traffic  analysis 

-  Information  aggregation  from 
various  sources 


Target 

Network 
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Network 
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Client 

Network 
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System  Overview 
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OpenIMP 
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DDoS  Detection  Metrics 


•  Some  examples 

-  Packet  Count  (above) 

-  Byte  Count 

-  Packet  count  per  flow  /  flag  /  message  type 

•  Transformations 

-  CUSUM  (below) 

-  Wavelet 

-  Entropy 

•  A  multitude  of  proposals  in  different  papers! 

•  Which  ones  to  implement? 


FOKUS 
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Latent  Semantic  Indexing 


•  allows  to  reduce  a  multi-dimensional  feature  vector 

•  into  a  lower-dimensional  feature  vector  (easier  to  process) 

•  information  preserving  (principle  components) 

•  maps  all  metrics  into  one  uniformly  sized  feature  vector 


metric  1 
metric  2 
metric  3 


metric  N 


x  LSI(k)  = 


_y 


index  a 
index  b 


-< 


index  k  J 
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Index  b 


Cluster  Detection 
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Cluster  Detection 


•  Unknown  Clusters  are  a  possible  threat 

•  Reactions  include 

-  Filtering,  if  bandwidth  is  scarce  anyway 

-  Detailed  analysis  of  identified  anomalies 


FOKUS 
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What  it  looks  like... 


active  Flows 
d  e  ItaPacketC  o  u  ntM  e  an 
d  e  ItaPacketC  o  u  ntD  e  v 
p  acketLe  n  gth  M  e  an 
deltaOctetCountCusum 
packetDeltaCount 
d  e  ItaO  ctetC  o  u  ntM  e  an 
d  e  ItaO  ctetC  o  u  ntD  e v 
taskld 

d  e  ItaPacketC  o  u  ntC  u  s  u  m 
octetDeltaCount 
tstamp.sec 
d  e  ItaTC  PPacketC  o  u  nt 
d  e  Itatl  D  PO  ctetC  o  u  nt 
d  e  ItaT C  PO  ctetC  o  u  nt 
d  e  Itatl  D  PPacketC  o  u  nt 
d  e  Ital  C  M  PO  ctetC  o  u  nt 

d  jJ.t.o.l  oi  uo± 


octetDeltaCount 


▼  deltaPacketCountCusum 


14:00  14:02  14:04  14:06  14:08  14:10  14:12  14:14  14:16  14:18  14:2i 
Time 


Cluster  Detection 


no  attack  detected 


Stop  detailed  analysis 
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The  advantage  of  using  IPFIX 


•  Established  standard  for  network  metrics 

•  New  probes/metrics  can  be  added  into  the  system 

-  They  immediately  speak  the  language  of  the  system 

-  Standard  components  (routers)  may  provide  the  data 

-  A  training  phase  is  needed  for  new  information  sources 

•  Latent  Semantic  Indexing  reduces  any  number  of  metrics 

•  Cluster  Detection  operates  on  the  same  feature  space  size 

•  Detection  seamlessly  integrates  new  IPFIX  information  sources 


FOKUS 
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Thank  You! 


FOKUS 
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Questions? 
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A  Software  Tool  for  Multi-Field  Multi-Level 

NetFlows  Anonymization 

<http://scrub-netflows.sourceforge.net/> 

William  Yurcik 

Clay  Woolam,  Latifur  Khan,  Bhavani  huraisingham 

University  of  Texas  at  Dallas 


Anonymization? 

Anonymization  enables  entities  to  share  types  of  data 
that  would  otherwise  not  be  shared 

(1)  Private  Data 

-  User-identifiable  information 

•  user  content  (Email  messages,  URLs) 

•  user  behavior  (access  patterns,  application  usage) 

-  Machine/Interface  addresses 

•  IP  and  MAC  addresses 

(2)  Secret  Data 

System  configurations  (services,  topology,  routing) 

-  Traffic  patterns  (connections,  mix,  volume) 

Security  defenses  (firewalls,  IDS,  routers) 

-  Attack  impacts 


OOOOOOOOOOOOOOO 
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Sharing? 


•  Chasing  attackers  away  (to  other  organizations) 
does  not  improve  security 


•  Security  data  is  needed  between  organizations  to 
correlate  events  across  administrative  domains 
(cumulative  learning  between  organizations) 

-  Detect  attacks 

-  Blacklist  attackers  and  attacker  techniques 

-  Distinguishing  between  normal  and  suspicious  network 
traffic  patterns 


ooooooooooooooo 
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SCRUB*  Infrastructure 


SCRUB 
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CANINE  (Flocon’05) 
a  NetFlows  Converter/Anonymizer 


•  CANINE:  Converter  and  ANonymizer  for  Investigating 
Netflow  Events 

<http://security.ncsa.uiuc.edu/distribution/CanineDownLoad.html> 

•  Converter 

-  Cisco  V5  &  V7,  ArgusNCSA,  CiscoNCSA,  NFDump 

•  Anonymizer 

-  5  NetFlow  fields  (multi-field) 

(1)  IP,  (2)  Timestamp,  (3)  Port,  (4)  Protocol,  (5)  Byte  Count 

-  Multiple  options  for  each  field  (multi-level  anonymization) 

•  Java  GUI  -  easy  to  use  point-and-click 

ooooooooooooooo  The  University  of  Texas  at  Dallas 


IP  Address  Anonymization  in  CANINE 
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H  (Flocon’08) 

New  &  Improved  NetFlows  Anonymizer 

•  ASCII-based  PERL  code 

-  works  on  any  NetFlows  format  converted  to  ascii 

-  optimized  code  (multi-threaded  parallelization) 

•  Anonymizes  more  NetFlow  fields  (10>5) 

-  adding  support  for  additional  fields  is  minimal 

-  (6)  TimeStamp  (first/last  pkt)  (7)  TOS  (8)  TTL  (9)  TCP  Flags  (10)  Packet  Count 

•  Improved/More  anonymization  options  per  field 

-  Fixes  Crypto-PAn  IP  address  anonymization  flaw 

-  Working  on  tailoring  semantics  to  low/medium/high 

•  Command  line  operation 

-  UNIX  friendly,  consistency  with  other  SCRUB*  tools 
cascaded  streaming  operation  available  via  piping 


ooooooooooooooo 
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SCRUB-NetFlows 

Multi-Level  Anonymization  Options 

•  Black  Marker  (filtering/deletion) 

•  Pure  Randomization  (replacement) 

•  Keyed  Randomization  (replacement) 

•  Annihilation/Truncation  (accuracy  reduction) 

•  Prefix-Preserving  Pseudonymization  (IP  address) 

•  Grouping  (accuracy  reduction) 

-  Bilateral  Classification 

•  Enumeration  (time,  adding  noise) 

•  Time  Shift  (time,  adding  noise) 


ooooooooooooooo 
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Example:  Timestamp  Field  (First/Last  Pkt) 

•  Black  Marker 

-  replacement  of  field  with  a  predefined  constant  (0) 

•  Random  Time  Shift 

-  increments  given  time  by  a  random  value  within  a 
user  defined  window 

•  Enumeration 

-  sorts  entries  by  timestamp,  applies  black-marker 

•  Distance-preserving  pseudonymization 

-  preserve  distance  between  two  timestamps 

•  More 

-  including  pure/keyed  randomization,  truncation,  unit 
annihilation 


ooooooooooooooo 
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Addressing  Crypto-PAn  Flaw 
in  SCRUB-NetFlows 


•  Crypto-PAn  is  widely  used  for  prefix-preserving 

pseudonymization 

-  flaw  discovered  -  attacker  can  reverse-engineer  the 
original  prefix  mapping  in  a  given  dataset 

•  Our  use  of  Crypto-PAn 

-  Begin  with  two  separate  instances  of  Crypto-PAn 
with  two  distinct  keys:  Cryptl  and  Crypt2 

-  Determine  network  and  host  portion  of  IP  address 

-  Run  Cryptl  and  Crypt2  on  the  IP  address 

-  Return  the  network  of  Cryptl  concatenated  with  the 
host  given  by  Crypt2 


ooooooooooooooo 
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Example  usage 

•  Anonymizations  done  on  one  line  of  an  Argus 
NetFlow 

-  The  program  is  told  to  black  marker  the  source  IP, 
randomize  the  destination  IP,  and  black  marker  the 
first  timestamp 

$  ./scrub-netflow.pl  -r  ArgusData_  146  78  -w  AnonData  -o  "srcip  bin  dstip  rand  firsttimestamp  bin1' 
Anonymizing  ARGUS  format 
$  tail  -n  1  AnonData 

31  lan  71  01:01:01  02  Oct  03  14:0G:50  udp  10.10.10.11.1118  ->  39. 7. 114. 87.55525  6  0 

4856  0  INT 

$  tail  -n  1  ArgusData_146_78 

32  Oct  03  14:00:00  02  Oct  03  14:06:50  udp  132.156.189.139.1118  ->  228.154.76.120.55525  6 

3  4856  0  INT 


ooooooooooooooo 
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Anonymization  for  Sharing: 
The  Privacy  vs.  Analysis  Tradeoff 


while  anonymization  protects  against  information  leakage  it 
also  destroys  data  needed  for  security  analysis 

-  Zero-Sum?  (more  privacy  <>  less  analysis  &  vice  versa) 

-  We  are  now  making  measurements  of  the  tradeoff 

•  another  story  but  we  can  talk  off-line 


ooooooooooooooo 
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Summary 

•  Critical  need  for  security  data  sharing  between  organizations 

•  Anonymization  can  provide  safe  security  data  sharing 

-  Multi-Field:  prevent  information  leakage 

-  Multi-Level:  no  one-size-fits-all  anonymization  solution 

•  SCRUB-NetFlows  as  part  of  a  data  sharing  infrastructure 
(SCRUB*)  supporting  multiple  data  sources 

-  NetFlows  is  not  the  only  data  source  of  interest 

•  No  “One-Size-Fits-AII”  anonymization  policy 

-  multi-level  anonymization  options  can/should  be  tailored  to 
requirements  of  sharing  parties  to  optimize  tradeoffs 

privacy/analysis  anonymization  tradeoffs  need  to  be  characterized 


ooooooooooooooo 
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scanners. 
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Objectives 

Analyze  year  long  trace  collected  using  SiLK 
tools  from  a  122  enterprise  network  for 
scanning  activity. 

Use  Threshold  Random  Walk  (TRW),  one  of 
the  most  effective  algorithms  for  early  scan 
detection,  to  detect  the  scanning  activity. 

Find  out,  if  using  Bloom  filters  along  with 
TRW,  sequentially,  can  we  detect  the 
scanners,  that  went  undetected  using  only 
TRW? 

And  we  shall  soon  be  enlightened  ....  © 
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Rationale 


TRW  is  very  effective,  but  has  some  problems: 

-  In  case  of  slow  or  stealthy  scanning? 

-  In  case  of  UDP  or  ICMP? 

-  In  case  of  repetitive  scanning? 

Using  Bloom  filter  to  eliminate  repetitive  input  to 
TRW  and  look  for  reverse  matches  in  time 
ordered  data. 

-  Can  we  detect  the  slow  scans? 

-  Can  we  detect  UDP  and  ICMP  scans? 

-  Can  we  score  ICMP  responses  to  non  ICMP? 

Lets  see  .... 
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Threshold  Random  Walk 


Scan  Detection  Algorithm  based  on 
sequential  hypothesis  testing. 

Uses  a  positive  reward  based  scan  detection. 

-  For  a  given  host,  keeps  a  ratio  which 

•  In  case  of  successful  connection,  is  decreased 

•  In  case  of  unsuccessful  connection,  is  increased. 

-  This  ratio  is  compared  with  two  thresholds 

•  If  it  goes  above  one,  then  it’s  a  scanner 

•  If  it  goes  below  the  other,  then  it’s  benign 

•  If  it  goes  neither  way,  i.e.,  is  in  between  the  two 
thresholds,  then  can’t  say 
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Threshold  Random  Walk 


The  ratio  is  calculated  as  : 

wT,,  _  P Iplffil  =  rrn  P^Pil Bl] 

l  )  ~  Pr[y|J70]  *=1  Prpi|-ffo] 

Where  the  probabilities  are  : 

Pr[lf  =  0  Ho]  =  I30,  Pr[l-  =  l|tf0]  -  1  -  0O 
Prjy  =  0  JTi]  =  Pr[K  =  l|Hi]  =  1  -  By 


-  Y  =  success  (0)  or  failed  (1)  connection  attempt 

-  HO  =  benign  hypothesis 

-  HI  =  scanner  hypothesis 

-  00  =  probability  that  the  source  is  benign,  for  a  successful  connection 
attempt 

-  01  =  probability  that  the  source  is  scanner  for  a  successful 
connection  attempt 

The  thresholds  are  calculated  based  on 

-  desired  true  positive  (P  =  0.99)  ^ 

-  desired  false  positive  (a  =  0.01]  ril  ^  ~  ^  t — ~  (tag  DALHOUSIE 
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Bloom  Filter 


Element  a 


Data  structure  used  to  test  the  membership  of  an 
element  for  a  given  set.  Uses  bit  array  to  record 
multiple  hash  values  per  element. 

Basic  properties  of  these  filters  are  : 

-  False  positives  possible,  but  no  false 
negatives. 

-  Elements  can  be  added  to  the  set,  but 
cannot  be  removed. 

-  The  higher  the  percentage  of  set  bits, 

higher  the  probability  of  false  positives.  H^  =  P 

-  Space  efficient  as  compared  to  other  set 
membership  testing  methods. 

-  Cannot  be  reverse  engineered  to  find  the  set 
of  elements  present  in  it. 
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TRW  +  Bloom 

TRW  hit  or  miss  definition  modified 

-  For  a  given  tuple  in  the  flow  record  eg  {sip,  dip} 

•  HIT  =  if  a  corresponding  response  entry  {dip, sip}  is  found  within  a 
specified  timeout  period 

•  MISS  =  if  a  corresponding  entry  {dip, sip}  is  not  found  within  a 
specified  timeout  period 

Bloom  Filter  uses  10  hash  functions  and  a  bit  vector  of  size 
2A32 

Simple  Set  up  : 

-  Pass  the  flow  records  through  the  bloom  filter  to  get  unique 
entries  for  a  given  specified  tuple 

-  Different  tuple  combinations  used  are  {sip, dip},  {sip,dip,protp}, 

{sip, dip, sport},  {sip,dip,dport},  {sip, dip, sport, dport, proto} 

-  Then  analyze  the  output  using  the  TRW  scanning  algorithm. 
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The  Dataset 
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Plot  shows  flows  per  number  of  IP  addresses. 
How  easy  it  is  to  defeat  TRW  in  this  network? 
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Issues  to  keep  in  mind 


Number  of  packets  per  flow  record  >=  1 

The  time  granularity  is  only  till  sec,  millisecond  not 
available. 

For  a  packets  received  in  the  same  sec,  the  order  of 
the  flow  records  is  the  outside  to  inside  seen  first 
always,  irrespective  of  the  actual  order. 

Background  noise  in  the  traffic. 

ICMP  ping  traffic  causes  false  detection. 
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Preliminary  Results 

Used  theta0=0.7  and  theta=0.3  for  the  TRW 
algorithm. 

Kept  a  timeout  period  of  10  sec. 

Max  number  of  false  positives  due  to  ICMP 


-  Is  this  a  ping  tunnel  ??? 

-  Lots  of  out  ips  contacting  single  in  ip  with  lots 

All 

53 

of  ping  requests  and  getting  responses, 

AII_no_icmp 

14 

effectively  lots  of  bytes  being  transferred. 

AII_SD 

3 

The  SD  and  SDP  options  using  bloom  detect 

AII_SDP 

3 

horizontal  scans. 

AII_SDSP 

16 

The  SDSP  and  SDDP  options  detect  vertical 

AII_SDDP 

4 

scans. 

AII_SDSDP 

5 

The  SDSDP  covers  both. 
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Likelihood  Ratio  LogScale 


Preliminary  Results 
Plot  of  Likelihood  ration  for  Scanners 


Time 
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Likelihood  Ratio  LogScale 


Preliminary  Results 

Plot  of  Likelihood  ration  for  “Can’t  Say”s 
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Likelihood  Ratio  LogScale 


Preliminary  Results 
Plot  of  Likelihood  ration  for  Benign 
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Initial  Conclusions 


The  Bloom  filter  ■  ■  ■  somewhat  reduces  the  false 
positives 

-  only  unique  entries  for  given  filter  criteria  considered  by  TRW. 

Using  specific  options  for  the  bloom  filter  it  is  faster  to 
detect  vertical  or  horizontal  scanning 

Need  to  improve  the  technique  by 

-  Checking  for  change  in  the  thetaO  and  theta  1  values 
effecting  the  overall  results. 

-  Check  for  real  time  scenarios. 

Some  IPs  go  Scanner;  then  return  to  Can’t  Say. 

Still  more  data  is  left  to  be  analysed  (In  progress) 

Certain  issues  mentioned  earlier  need  to  be  taken  care 
of  e.g  dealing  with  the  number  of  packets  per  flc^DALHOusiE 
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Network  Analysis  of  Point  of 
Sale  System  Compromises 


Operation  Terminal  Guidance 

Chicago  Electronic  &  Financial  Crimes 

Task  Force 

U.S.  Secret  Service 


•  Hypothesis 

•  Deployment  Methodology 

•  Data  Analysis 

•  Findings 

•  Discussion 


Hypothesis:  Remote  attackers  were  not 
targeting  point  of  sale  (POS)  system 
software. 


-  The  underlying  operating  system  and  installed 
applications  are  not  deployed  in  accordance 
with  Payment  Card  Industry  Data  Security 
Standard 

-  POS  system  compromises  are  a  result  of 
automated  scanning  and  vulnerability 
exploitation 
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Association  rules 

-  Clustering 

•  T:  Number  of  virtual  POS  systems  with  connection 
attempts  from  a  single  source 

•  n^  Number  of  packets  from  a  source  to  a  virtual 
POS  system 

•  N:  Total  number  of  packets  from  a  source  to  all 

three  POS  systems 
‘  _ 

•  N=X  n, 

Support(R)  =  #  connections  (POS  system  A,  B,  and  C) 
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Data  analysis  methodology  from 
yPouget  and  M.  Dacier.  “Honeypot  Based  Forensics.” 


Data  Analysis 
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Control  Group  Clusters 


Port 

Item  Sets 

Support  % 

Support  %  >  1  % 

80 

Cluster  1:  T=1,  N=3 

43.5% 

1 

Cluster  2:  T=1,  N=1 

10.9% 

Cluster  3:  T=2,  N=8  (n=5,  n=3) 

4.3% 

135 

Cluster  4:  T=1,  N=1 

54.5% 

2 

Cluster  5:  T=1,  N=2 

22% 

139 

Cluster  6:  T=1,  N=2 

75% 

1 

Cluster  7:  T=1,  N=3 

10.1% 

445 

Cluster  8:  T=1,  N=1 

20% 

2 

Cluster  9:  T=1,N=2 

70% 

Cluster  10:  T=1,  N=3 

7.1% 

1026 

Cluster  11:  T=1,  N=1 

53.5% 

1 

1027 

Cluster  12:  T=1,  N=1 

98% 

1 

1028 

Cluster  13:  T=1,  N=1 

83% 

1 

5901 

Cluster  14:  T=1,N=2 

90.9% 

1 

Port 


Item  Sets 


Support  % 


Support  %  >  1% 


445 


Cluster  1:  T=2,  N=34 


22.2% 


1026 


Cluster  2 
Cluster  3 
Cluster  4 


T=2,  N=3 

T=3,  N=3  (n=1,n=1,  n=1) 
T=1 ,  N=1 


1 .8% 
20% 
50.9% 


1394 


Cluster  5 
Cluster  6 
Cluster  7 
Cluster  8 


T=1 ,  N=12 
T=1 ,  N=15 
T=1 ,  N=6 
T=1 ,  N=9 


20% 
16.7% 
1 .7% 
16.7% 


2967 


Cluster  9:  T=3,  N=8  (n=2,  n=3,  n=3) 
Cluster  10:  T=3,  N=30  (n=10,  n=10,  n=10) 


10% 

10% 


5900 


Cluster  11:  T=3,  N=3 


20% 


ata  Analysis 


Edit  Distance  Analysis 

-  Extract  TCP  payloads 
from  previous  identified 
cluster  members 

-  Compare  packets  from 
each  IP  address 
against  all  others 
identified  through 
clustering 


Source  A 

Source  B 

<mss 

<mss 

E..0..@.o.A.;W\. 

E..0.{@.k.l\=.y. 

D  s  1 

'  ■■W.J . 

D..s . jd . 

p...A2 . 

P . 

<mss 

<mss 

E..0..@.o.A.;W\. 

E..0.{@.k.l\=.y. 

D  s  1 

^  . 

D..s . jd . 

p...A2 . 

P . 

Attack  Phrases 


Cluster 

Port 

Phrase  Distance  (Lines) 

Std  Deviation 

Cluster  6 

139 

2 

9 

Cluster  7 

139 

1 

5 

Cluster  8 

445 

3 

10 

Cluster  9 

445 

5 

8 

Cluster  10 

445 

4 

18 

Cluster  1 1 

1026 

86 

169 

Cluster  13 

1028 

12 

65 

Cluster  14 

5901 

32 

12 

***Clusters  1,2,  3,4,5,  and  12  were  discarded  as  not  statistically  significant 


Cluster 

Port 

Phrase  Distance  (Lines) 

Std  Deviation 

Cluster  2 

1026 

324 

238 

Cluster  5 

1394 

360 

85 

Cluster  6 

1394 

280 

170 

Cluster  7 

1394 

529 

136 

Cluster  8 

1394 

1422 

1143 

Cluster  1 1 

5900 

240 

257 

***Clusters  1,3,4,9,10  were  discarded  as  not  statistically  significant 
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UDP 

Destination 

Port 


Packet  Ethertype  |p  Version  IP 
Length  Header 

Length 


Seq  UDP 

Number  Source 
Port 


IP  IP  IP  IP  TCP  TCP 

Transport  Header  Source  Destination  Source  Destination 

Protocol  Checksum  Address  Address  Port  Port 


Total 

Length 


Differential 

Services 


Fragment 


methodology  from  Greg  Conti’s.  “Security  Data  Visualization 


ata  Analysis 


j  ^  Jr 

The  TCP  outlier  is 
associated  with 
browsing  public  web 
site  to  ensure 
connectivity 

Uniform  length  of 
packets 


ata  Analysis 


•  Examination  of  the  UDP  packets  identified 


in  the  previous  tree  map  revealed  them  to 


be  spam  targeting  messenger  applications 
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•  Automated  scanning  of  select  set  of  ports 


•  Multiple  exploits  targeting  multiple  OS’s 
from  single  source  IP  address 

•  Attackers  not  aware  compromised  system 
is  a  POS  system  until  after  compromise 
and  exploit 

•  Insecure  installation  of  operating  system 
and  applications  lead  to  compromise 


U  It 
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Discussion 
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Overview 

The  balance  of  anonymization 


Subnet-preserving 

Subnet-collapsing 

Host-preserving 

Host-collapsing 

Ports  &  Other  issues 

Conclusion 
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The  Balance  of  Anonymization 


Flow  itself  preserves  some  privacy 
by  aggregation  and  eliding  content. 

Anonymization  is  to  aid  in  preserving 
the  privacy  of  organizations 
represented  in  the  data 

•  Data  owner 

•  Partner  or  Customer 

•  Incidental 

•  Attacker 

The  more  you  anonymize  the  data, 
the  less  analyses  can  be  done  with 
it. 

Need  to  explore  a  range  of  options 


Privacy  Utility 


,CEOT 


Software  Engineering  Institute  CarnegieMellon 
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Subnet  Preserving 


Preserve  host  identity  while 
concealing  network. 

How: 

•  Prepare  list  of  networks 

•  Assign  random  substitution  for 
network  prefix 

•  Mask  and  replace  prefix  on  each 
address 

•  Associative  array  works  well  for 
substitutions 

Balance: 

•  Enables  analysis  down  to  host 
identity,  but  not  organization 
identity 

•  Can  be  reversed  by  outside 
knowledge  (server  suffixes) 


——  Software  Engineering  Institute  CarnegieMellon 


240.204.5.3 

010.005.5.3 


Subnet  Collapsing 


Conceal  network  structure  and  host 
identity,  but  preserve  commonality  of 
network 

How: 

•  Reduce  all  address  to  the 
network 

•  Prepare  random  substitution  for 
network 

•  Replace  address  with  network 
substitutions 

Balance: 

•  Allows  network-level  behavior 
analysis 

•  Might  be  reversed  by 
organizations  with  lots  of  contact 
with  data  source 


——  Software  Engineering  Institute  CarnegieMellon 
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240.204.0.0 
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Host  Preserving 

Preserve  host  identity  while 
concealing  network 
commonality 

How: 

•  Generate  list  of  addresses 

•  Generate  random  substitution 
for  each  address 

•  Replace  each  occurrence  with 
same  substitution 

Balance: 

•  Allows  host-specific  analysis 

•  Difficult  to  reverse 


,CEOT 


Software  Engineering  Institute  CarnegieMellon 


246.204.5.3 

10.2.3.9 

240.204.5.12 

192.168.12.7 
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Host  Randomizing 


Do  not  preserve  host  or 
network  identity  (a.k.a., 
remove  address  content  in 
any  useful  way) 

How: 


•  Replace  each  occurrence  of 
each  address  with  random 
value 

•  Allow  repetition  of  random 
values 

Balance: 

•  Only  permit  analysis  that  does 
not  involve  address 
information 

•  Extremely  difficult  to  reverse 


,CEOT 


Software  Engineering  Institute  CarnegieMellon 


248.204.5.3  128.0.3.2 

10.2.3.7  192.168.7.12 
248.204.5.3  0.5.4.1 
192.168.17.37  10.2.3.7 
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Ports  and  Other  Issues 


There’s  more  to  anonymization  of  flow  than 
addresses 

•  Network  ports  can  be  very  revealing  (OS  fingerprinting) 

•  Timing  information  might  be  revealing 

•  TCP  flags  might  be  revealing  (odd  patterns) 


Can  anonymize  this  information: 

•  Ports:  reduce  to  service,  substitute:  reduce  to 
common/reserved/dynamic 

•  Timing:  restart  epoch;  rescale  timing:  collapse  interval 

•  TCP  flags:  reduce  to  function;  remove  OS-dependencies 


fQ£pj  |  ^ —  Software  Engineering  Institute  CarnegieMellnn 


Conclusion 

Data  sharing  is  difficult 


Anonymization  can  be  useful,  but  limiting 
Anonymized  does  not  mean  private  or  irreversible 


|  ^ —  Software  Engineering  Institute 


Carnegie  Mellon 
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Outline 

Problem  statement  and  objectives 

■  Adapting  flow  information  granularity 

Increasing  granularity  with  zoom  monitors 

Decreasing  granularity  with  relevance-sensitive  compression 

■  Implementation 

■  Results 

Conclusion  and  outlook 
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Problem  Statement 

■  Trade-off  in  network  traffic  information  collection  for  incident  analysis 

Raw  packet  traces:  finest  level  of  detail  but  impractical  to  manage  and  search 
Flow  traces:  high-level  traffic  abstraction  but  aggregated 

■  Traditional  flow  exports  may  not  provide  traffic  details  required 

to  understand  causes  of  incidents 

Sampling  on  metering  device 

Aggregated  IP  addresses  (prefixes)  or  AS  level  information  in  exports 
Missing  layer  2  and  layer  3  header  information 
No  packet  content  information 

Flow-level  information  is  often  redundar  for  incident  analysis 

Limited  additional  value  on  the  flow  level  when  given  a  set  of  prior  traffic  observations 
Sequences  of  similar  flows  (streams,  remote  sessions,  web/mail  traffic,  file  transfers) 

Flow  record  collections  are  still  tedious  to  search,  store,  and  analyze 
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Objectives  and  Goals 

Extend  a  collector  system  to  enable  more  accurate  incident  analysis 
Adapt  information  granularity  depending  on  relevance  of  the  traffic: 

Focus  in  on  particular  traffic  events  to  obtain  more  details 

Compress  known/less  relevant  traffic  events  (conserve  a  meaningful  abstraction) 


r~  2. 
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Traffic  Collection  for  Incident  Analysis 


After-the-fact  analysis 


Refine  assumptions 


Reproduce 
event  trail 


Understand/ 
Infer  causes 


Conclude 


Real-time  analysis 


Initial  guess 

(inferred  from  monitoring 
system  output) 


Collect  more 
information 


Refine  assumptions 


Examine 

information 


Understand/ 
Infer  causes 


Conclude 


■  Future  incident  trap 
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Increasing  Traffic  Information  Granularity 

■  Problem 

Collecting  detailed  traffic  information  is  cumbersome 

Fixed  and  limited  amount  of  information  in  traditional  flow  exports  (e.g.,  NetFlow  5) 
Management  of  information  load  collected 


■  Traditional  approach 

Physically  attach  a  probe  or  packet  dumping  device  at  router  (e.g.,  tcpdump  with  filtering) 
Collection  of  rigid  traffic  information  (e.g.,  entire  packets) 

No  aggregation  of  data  (e.g.,  bytes),  analyze  collected  data:  manual  scripting 

■  How  to  simplify  data  collection?  Create  zoom  monitors! 

Dynamically  controlled  collection  of  relevant  traffic  information  at  desired  level  of  detail 
Centralized  management  of  data  collection  campaigns 
Make  use  of  capabilities  of  network  device  inventory  (routers,  switches) 
e.g.,  Cisco  IOS  Flexible  NetFlow 
Off-load  aggregation  and  filtering  to  network  devices 
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Zoom  Monitors 


■  Specification 

Metering  point  and  collector  device 
Zoom  monitor  lifespan 
Filter  criteria 

Traffic  aspects  to  be  exported 

■  Export  collection  and  display 


Collector  device 


Reconfigure  metering  device  to  create  specific  exports 
Prepare  collector  device  to  store  exported  traffic  information 
Visualization  of  stored  information  (user  interface) 


■  Examples 

Show  me  the  payload  of  all  DNS  requests  of  host  10.3.4.5  in  the  next  10  minutes 
Look  for  all  internal  hosts  scanning  on  TCP  service  port  9996  (e.g.,  candidate  worm  traffic) 
Account  all  traffic  flows  using  a  particular  service  type  (e.g.,  Voice  over  IP) 

Export  unsampled  flow  measurements  from  subnet  10.9.3.1/24 
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Decreasing  Traffic  Information  Granularity 


r< — >o 

f> 

— T 

L< — 

b> 

■  Problem 

Most  stored  traffic  information  is  irrelevant  for  incident  analysis  (never  accessed/requested) 
Increased  storage  overhead  and  search  complexity 


■  Traditional  approaches 

Rolling  database:  keep  all  flow  records  up  to  a  limit  (e.g. ,  #entries,  age):  information  removal 
Uniform  compression:  adapt  resolution  of  flow  information  (hourly,  daily,  weekly) 

Keep  top-k  entries  (according  to  some  aspect) 

■  How  can  we  do  better? 

Gradually  compress  information  of  irrelevant  traffic  events  in  a  lossy  fashion 
With  minimal  impact  on  incident  analysis  tasks 
Summarize  similar  events  (coarse-grained  representation) 
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relevance 


Z 


How  should  the  traffic  be  compressed? 


Typical  information  relevance  patterns 


decreasing  with  time  after  collection  (normal) 
/  irrelevant 

non-decreasing  (keep  for  later  analysis) 

/'  regaining  relevance  in  posterior  analysis 


> 


A  time 

(after  observation) 


Multi-staged  granularity  reduction  over  time  and  with  relevance 

We  model  information  relevance  with  a  “temperature”  value:  “hot”  for  latest  events 
Temperature  decreases  gradually  over  time:  temperature  *  interestingness  of  data 
Degression  of  granularity  takes  place  as  a  function  of  the  temperature 
Temperature  can  be  increased  for  abnormal  events:  keep  fine-grained  representation 
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Observations 


=  exported  flow  record 
=  inactive/active  timeouts 


■  Flow  export  problem  j 

:< 

n 

k- 

A 

H  N 

Aa 

*  k 

H 

Multiple  exports  for  a  single  connection 

£ 

<?  i 

w 

Examples:  12  34 


Long-lived  connections  (streams,  remote  sessions,  etc.) 

Timeouts  on  routers  (inactive/active  timeout)  _ 

Change  in  service  type  (ToS  field) 

1  2  3 

■  Bi-directionality  problem 


Most  flows  have  a  reverse  counterpart  (=  redundancy) 

■  Information  similarity  problem 


*- 


Sets  of  records  with  limited  added  value  on  the  flow  level 


Examples: 

Groups  of  flows  with  similar  properties 
(Web,  mail,  printer  traffic,  polling) 

Known  short-lived  flows  (DNS  queries,  etc.) 

Typical  similarity  properties: 

<IP,port>,  <IP, application^  <subnet,application> 
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Compression  Model1 


Abstraction  models 

Flow  record 

Flow 

Conversation 

Session 

Raw  exports 

Yes 

No 

No 

No 

Flow  definition 
(5-tuple) 

Yes 

Yes 

Yes 

No 

(subset  thereof) 

Direction 

Uni-directional 

Uni-directional 

Bi-directional 

Bi-directional 

#  Flow  records 

1 

>  1 

>  1 

>  1 

#  Flows 

1 

1 

1  or  2 

>1  or  >  2 

#  Conversations 

1 

1 

1 

>  1 

1  without  prior  knowledge  such  as  domain  or  application  specific  information 
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Implementation 

■  Metering  device  configuration  for  zoom  monitors 

Reconfiguration  of  metering  devices 
Management  console 

■  Export  collector 

Storage  and  querying 

■  Traffic  information  compression 

Aggregation  technique 
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Metering  Device  Configuration 

■  Technologies 

-  Cisco  IOS  Flexible  NetFlow  (FNF) 

Configuration  of  multiple  customized  monitors 

Currently:  input  filtering  for  FNF  monitors  not  available  (input  filters  needed  at  collector) 
Hespera  Traffic  Meter  (IBM  Research) 

Software-based  flow  monitor  supporting  NetFlow  v5  and  v9,  IETF  IPFIX  exports 
Customized  flow  exports  (variable  templates),  CLI-based  reconfiguration 
Filtering  with  tcpdump  syntax 


■  User-based  creation  of  dynamic  zoom  monitors 

Web-based  specification  of  zoom  monitors 

Deployment  on  metering  device  (CLI-based)  and  management  (e.g.,  lifespan) 
Future:  XML-based  configuration  (cf.  [Dimitropoulos/Kind]  or  [NetConf]) 
Registering  the  zoom  monitor  at  collector  device  (for  disambiguation/triage) 
Pre-defined  zoom  monitor  templates  from  library 
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Export  Collector 

■  Prototype  based  on  the  Aurora  flow  analyzing  system  (IBM  Research) 

Replaced  existing  aggregation  database  (ADB)  with  PostgreSQL  (PG)  backend 
Input  triage  according  to  zoom  monitors 
Relevance-sensitive  compression  for  default  flow  exports 
Extension  of  the  web  user  interface 


Aurora  project 

(C _ ^ 

1  71—| 

r  i 

n=X 

-  NetFlow/IPFIX 

- ► 

ADB 

◄ - 

HTTPs 

“  3  =  "  75;  =E  ’"‘Sss  ss- 


reconfiguration 
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Traffic  Information  Compression 


■  Incrementally  populate  the  databases 

Update  entries  in  databases 

Remove  entries  based  on  temperature  values  from 
finer-grained  databases 

Keep  “Session”  database 

■  Considerations 

1  -by-1  inserts/updates  are  generally  slow 
Prepare  entry  sets  and  use  bulk  imports 
Partitioning  and  indexing 
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Flow  records 
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Create  New  Zoom  Monitor 
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Name 

Description 


[ 


ir 


Filter 


Load  existing  template:  Destination  address  Destination  prefix  Empty  template 


^-|  Export  template" 


Router  and  Interface 


Router 

.zurich.ibm.com  |v 

Interface 

FastEthernet 1/0  (  )  [v 

Direction 

input  v 

r- 1  Zoom  monitor  lifespan ~]- 

®  Ad-hoc  zoom  monitor 


^-|  Metering  cache  ]- 


Type 

immediate  v 

U  Entries 

8192 

default 

Active  timeout 

30  min 

default 

Inactive  timeout 

10  sec 

default 

IPv4  Information  v  Destination  Address  v 

LE 

IPv4  Transport  v  TCP  v  Destination  port 

3 80 

□  m 

IPv4  Information  v  Source  Address 

v  key  field 

0| 

□B 

IPv4  Information  v|  Protocol 

v  key  field 

□  m 

IPv4  Information  v|  Section 

v|  340 

□  m 

Load  existing  template:  NetFlow  5  Empty  template 

Start 

now  v 

Duration 

30  sec  [vj| 

O  Specify  start  and  end  time 

|  Flow  Exporter/Col 

®  Configured  collec 

- "i 

lector  1 

tor 

Collector 

(udp ://  :2095)  v 

O  Create  new  collector 


Save  as  template  Create  zoom  monitor 
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Filter  definition 


>  Export  information 


Router/Interface 

Lifespan 

Collector 

Cache 
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Zoom  Results:  Sessions 

— |  Filler  | - 
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End 
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Results:  Compression  (WAN  traffic) 


Nb  of  records  in  per  bin 
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Session  inactive  timeout:  20min 


Average  compression  ratio 

#flow  records  :  #flows  1.26  a  =  0.07 

#flow  records  :  Conversations  2.34  o  =  0.28 
#flow  records  :  #sessions  22.80  o  =  7.00 


Nb  of  records  in  DB 
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Results:  Compression  (datacenter  traffic) 


Nb  of  records  in  per  bin 


-  Average  compression  ratio 

#flow  records  :  #flows  1.19  o  =  0.02 

#flow  records  :  Conversations  2.35  o  =  0.07 
#flow  records  :  #sessions  5.39  o  =  0.46 


Records  to  Flows 
■  Records  to  Conversations 
ft  Records  to  Sessions 


Session  inactive  timeout:  20min 
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Future  Work  and  Visions 

■  Automated  zoom  monitor  creation 

Interface  to  a  behavior-based  network  anomaly  detection  system 

Proactive  collection  of  proofs  for  posterior  forensic  analyses  of  abnormal  events 


■  Distributed  collector  infrastructure 

Distributed  collectors,  e.g.,  at  multiple  sites  (scalability) 

Transfer  required  information  to  central  reporting  system  on  demand 


■  Enhance  compression  technique 

Meta-data  representation  using  anomaly  sensor  input 
Application-sensitive  compression 


■  Cisco  IOS  Flexible  NetFlow  with  input  filters 

Perform  filtering  on  routers  to  replace  software-based  metering  (and  filtering) 
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Conclusion 

■  Incident  analysis  tool  adapting  flow  information  granularity 

Increase  level  of  detail  of  relevant/unknown  traffic  events 
Decrease  level  of  detail  (compress)  of  less  relevant  events 
Keep  a  meaningful  abstraction  of  all  traffic  events 


■  Creation  of  customized  zoom  monitors 

Zoom  in  on  specific  traffic  to  gain  additional  information  about  its  properties  and  behavior 
Centralized  management  of  metering  devices  for  traffic  detail  collection 


■  Reduced  long-term  storage  requirements 

Encouraging  test  results  with  multiple  flow  information  granularity  levels 
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Overview 

The  balance  of  anonymization 

Subnet-preserving 

Subnet-collapsing 

Host-preserving 

Host-collapsing 

Ports  &  Other  issues 

Conclusion 
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The  Balance  of  Anonymization 


Flow  itself  preserves  some  privacy 
by  aggregation  and  eliding  content. 

Anonymization  is  to  aid  in  preserving 
the  privacy  of  organizations 
represented  in  the  data 

•  Data  owner 

•  Partner  or  Customer 

•  Incidental 

•  Attacker 

The  more  you  anonymize  the  data, 
the  less  analyses  can  be  done  with 
it. 

Need  to  explore  a  range  of  options 


Privacy  Utility 
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Subnet  Preserving 


Preserve  host  identity  while 
concealing  network. 

How: 

•  Prepare  list  of  networks 

•  Assign  random  substitution  for 
network  prefix 

•  Mask  and  replace  prefix  on  each 
address 

•  Associative  array  works  well  for 
substitutions 

Balance: 

•  Enables  analysis  down  to  host 
identity,  but  not  organization 
identity 

•  Can  be  reversed  by  outside 
knowledge  (server  suffixes) 


Software  Engineering  Institute  Carnegie  Mellon 


240.204.5.3 

010.005.5.3 


Subnet  Collapsing 


Conceal  network  structure  and  host 
identity,  but  preserve  commonality  of 
network 

How: 

•  Reduce  all  address  to  the 
network 

•  Prepare  random  substitution  for 
network 

•  Replace  address  with  network 
substitutions 

Balance: 

•  Allows  network-level  behavior 
analysis 

•  Might  be  reversed  by 
organizations  with  lots  of  contact 
with  data  source 
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240.204.0.0 


010.005.0.0 


Host  Preserving 

Preserve  host  identity  while 
concealing  network 
commonality 

How: 

♦  Generate  list  of  addresses 

♦  Generate  random  substitution 
for  each  address 

♦  Replace  each  occurrence  with 
same  substitution 

Balance: 

♦  Allows  host-specific  analysis 

♦  Difficult  to  reverse 
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10.2.3.9 

240.204.5.12 

192.168.12.7 


Host  Randomizing 


Do  not  preserve  host  or 
network  identity  (a.k.a., 
remove  address  content  in 
any  useful  way) 

How: 


•  Replace  each  occurrence  of 
each  address  with  random 
value 

•  Allow  repetition  of  random 
values 

Balance: 

•  Only  permit  analysis  that  does 
not  involve  address 
information 

•  Extremely  difficult  to  reverse 


Software  Engineering  Institute  Carnegie  Mellon 


248.204.5.3  128.0.3.2 

10.2.3.7  192.168.7.12 
248.204.5.3  0.5.4.1 
192.168.17.37  10.2.3.7 
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Ports  and  Other  Issues 


There’s  more  to  anonymization  of  flow  than 
addresses 

•  Network  ports  can  be  very  revealing  (OS  fingerprinting) 

•  Timing  information  might  be  revealing 

•  TCP  flags  might  be  revealing  (odd  patterns) 


Can  anonymize  this  information: 

•  Ports:  reduce  to  service,  substitute;  reduce  to 
common/reserved/dynamic 

•  Timing:  restart  epoch;  rescale  timing;  collapse  interval 

•  TCP  flags:  reduce  to  function;  remove  OS-dependencies 
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Conclusion 

Data  sharing  is  difficult 


Anonymization  can  be  useful,  but  limiting 
Anonymized  does  not  mean  private  or  irreversible 
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Using  the  Google  Maps  API 
for  Flow  Visualization 

Where  on  Earth  is  my  Data? 


Sid  Faber 

Network  Situational  Awareness  Group 
sfaber@cert.org 
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Agenda 

•  Step  1 :  Extracting  Flow  Data 

•  Step  2:  Geolocation 

•  Step  3:  Convert  to  XML 

•  Aside:  The  Google  Maps  API 

•  Step  4:  The  HTML  Page 

^  Software  Engineering  Institute  ( jimrgirMrllnn 
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Data  Used  for  Demo 

SC06  Data  Set 

•  November  14,  2006 

•  Goal  is  to  look  at  who  talked  to  whom 


fCERT 
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Extracting  Flow  Data 

What  story  do  you  want  to  tell  with  geolocation? 

•  Traffic  source  or  destination 

—  Data  record  =  one  value  per  address 

•  Relations  between  addresses 

—  Data  record  =  one  value  per  source,  destination  address  pair 


CERT 
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Extracting  Flow  Data:  SiLK  Example 


Traffic  destination 

$  r  wf  i  I  t  e  r 

-  - st  art  =2  0  0  6  /  1  1  /  1  4 

-  - pr  ot  o=0- 255 

-  -  c  I  as  s  =a  I  I  -  -  pa s  s  =s  t  do ut 
|  r  wu  n i  q 

■■fields  =d  ip  -  -  bytes  >  dst.txt 
140.221.159.103  12568504471655 

172.30.5.11  11381325217792 

172.30.6.11  7397483692032 


'CERT 
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Step  1:  Summary 


Extract  Flow  Data 

•  Start  with  raw  flow  data 

•  End  with  summarized  flow  data  (2  columns) 

—  Destination  IP,  value 

—  Space  delimited 


For  Example: 

1  4  0.2  2  1.1  5  9.  1  0  3 
1  7  2.  3  0.  5.  1  1 
1  7  2.  3  0.  6.  1  1 


12568504471655 

11381325217792 

7397483692032 


r* 
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Geolocating  by  Country 


Map  IP  to  Country:  IPligence,  http://www.ipliqence.com 


"  0000000000  ",  "  0033554431  ",  “US'1,  "UNITED  STATES' 
"0033554432" , " 0050331647 "  " DE " , " GERMANY" , " EU" , 

Map  Country  to  Lat/Long:  MaxMiru 
http://www.maxmind.com/app/countrv 

US, 3  8.  0  0  0  0, -  9  7.  00  0  0 
DE, 51.  0000, 9.  0000 

Combine  IP-to-Lat/Long  M 

00  0  0  00  0  0  0  0  0  0  3  3  5  5  44  3  1  IJS  3  8.0  0  0  0  -  9  7.0  0  0  0 
0033554432  0050331647  DE  51.0000  9.0000 

0050331648  0067108863  HK  22.2500  114.1667 


■ NA" . . . 
EUROPE" 

Numeric  IP 


'CERT 
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Geolocating  by  Addresses 


DNS  LOC 

$  host  - 1  LOC  emu.  e d u 

cmu.edu  LOC  4  0  2  6  3  9.00  0  N  7  9  5  6  3  6.2  0  0  W  2  8  3.  0  0  m 

Caida  Netgeo 

$  wget  http://netgeo.caida.org/perl/netgeo.cgi  \ 

?t  ar  get  =128.  2.  10.  162 

TARGET:  128.  2.  10. 162  <br> 

NAME:  CMU-  NET <br  > 

NUMBER:  128.2.0.0  ■  128.  2.  255.  255  <br> 

LAT:  40.  4 4  < b r  > 

LONG:  -  7  9 . 9  5  <br  > 


Hostip.info,  http://www.hostip.info/dl/index.html 


r* 
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Sample  Commercial  Data:  Quova 

1  start  ip  i  nt 

50331648 

67272896 

2  end  i  p  i  nt 

50378239  ^  — 

— 6  7  7  7  j>  Q  5  Q 

3  c i  dr 

24 

1 6  Numeric  IP 

4  cont i  nent 

north  a  me  r  i  c  a 

north  a  me  r  i  c  a 

5  country 

united  states 

united  states 

6  count  r  y_i  s o 2 

us 

us 

7  count  r  y_cf 

80 

97 

8  region 

northeast 

northeast 

9  state 

connect i  cut 

ma  s  s  a  c  h  u  s  e  1 1  s 

10  state  cf 

10 

87 

11  city 

f  a  i  r  f  i  e  1  d 

woburn 

12  ci  t  y _ c f 

10 

77 

13  postal  _code 

0  6  8  2  5 

01888 

14  phonenumberpref i 

x  203 

781 

15  t  i  mezone 

-  5 

-  5 

16  latitude 

41.  1753 

42. 4867 

17  longitude 

•  7  3. 2  8  1  2 

■  71. 1543 

^  Software  Engineering  Institute 
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Add  location  to  data  and  regroup 


Perl-fu  pseudocode: 

Read  location  data  into  a  lookup  table 
For  each  line  of  data  { 

Ext  ract  IP  and  [value] 

Find  I  at ,  I  ong  coordi  nat  es  for  IP 

Create  a  bin  for  the  coordinates  and  add  [value] 

} 

Print 


12.178.6.55 

24.168.15.130 
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Geolocating  with  SiLK  pmaps 


Prefix  maps  associate  a  value  with  an  IP  address 
prefix 


•  Text  based  pmap: 


#St  art  -  I  P  End-  I  P _ CC  Lat _ Long 

00  3  3  5  5  44  3  2  0  0  5  0  3  3  1  64*7  DE  5  1.0  0  0  0  9.0  00  0 
0050331648  0067108863  HK  22.2500  114.1667 


pmap  value 


r* 
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Building  the  Geolocation  pmap 

Some  perl-fu: 


read  countrylatlng.txt  into  a  hash 
foreach  line  in  the  ipligence  data  set  { 

look  up  the  countrylatlng.txt  line  for 
the  code 

print  out  the  ip  range,  country  code  and 
coor  di  nates 

} 


CERT 


-  See  make-geo-cc-pmap.pl  in  the  sample 
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Using  the  Geolocation  pmap 


Use  the  pmap  with  rwuniq: 

$  r  wf  i  I  t  e  r  \ 

--start =2 006/11/14  \ 

--proto =0-255  \ 

-  -  c  I  a s  s  =a  I  I  -  -  pa s  s  =s  t  dou t  \ 

|  rwuniq  \ 

--pmap-fi  I  e=geo-cc.  pmap  \ 

-  -  f  i  el  ds  =d  v  a  I  --bytes  --del  imited="  "  --no-titles  \ 
>  geo- dst .  t  xt 

US  3  8.  0  0  0  0  -  9  7.  0  00  0  1  0  2  3  7  2  3  1  9  2  3  6  5  80 
J  P  36.  0000  138.  0000  9965004709495 

CA  6  0. 0  0  0  0  -  9  5.  0  00  0  5  6  9  98  9  2  3  9  2  7  8 
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Step  2:  Summary 

Geolocate  Flow  Data 

•  Start  with  summarized  flow  data 

•  End  with  location  data  (4  columns) 

—  Destination  label,  latitude,  longitude,  value 
—  Space  delimited 

—  SiLK  pmaps  combine  steps  1  and  2 

•  For  example: 

US  3  8.  0  0  0  0  -  9  7.  00  0  0  1  0  2  3  7  2  3  1  9  2  3  6  5  8  0 
J  P  36.  0000  138.  0000  9965004709495 
CA  6  0.  0  0  0  0  -  9  5.  00  0  0  5  69  9  8  9  2  3  9  2  7  8 
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XML  Data 

Convert  to  XML 

•  The  GoogleMaps  routine  we’ll  be  using  takes  XML  input 

•  We  define  the  schema 

•  We’ll  process  Step  2  data  with  a  simple  awk  command 

$  cat  geo- dst .  t  xt  |  \ 

awk  1  BEGIN  {print  "  <  ma  r  k  e  r  s  > " }  \ 

{  printf  "cmarker  I  b I  =\ "  %s \  "  I  a t  =\ "  %s \ "  lng=\"%s\"  \ 
v  a  I  =\ "  %s  \ "  /  >  \  n " ,  $  1 ,  $  2 ,  $  3 ,  $  4 }  \ 

END  {  print  "  </  ma  r  k  e  r  s  >" }  1  \ 

>  geo  -  dst .  xml 
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Step  3:  Summary 


Convert  to  XML 

•  Start  with  labels,  coordinates  and  values 

•  End  with  XML  document  with  the  same  data 


•  For  example: 


<ma  r  k  e  r  s  > 

<mar  ker  I  b I  = " C N " 
<mar  ker  I  bl  =" MR" 
<mar  ker  I  b I  = " K N " 
</  mar  ker  s  > 


I  at  ="  3  5.  0  00  0 
I  at  ="  2  0.  0  00  0 
I  at  ="  1  7.  3  3  3  3 


I  ng="105.  0000" 
I  ng=" -  1  2.  0  0  0  0  " 
I  ng="- 62.  7500" 


val  =" 704206"/ > 
v  a  I  ="  2  0  0  "/ > 
val  ="646"/ > 


r* 
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Google  Maps  Widgets 


r* 
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Google  Maps  API  Fundamentals 

http://code.qooqle.com/apis/maps/documentation/ 

•  Very  well  documented,  lots  of  examples 

•  Start  simple  (like  this  demo) 

•  Requires  very  basic  javascript  and  HTML  knowledge 

General  flow: 

•  Include  the  source  code 

•  Create  the  map 

•  Drop  markers  onto  the  map 
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About  keys  and  data 

In  order  to  include  the  library  source,  you  need  a  key 

•  The  key  uniquely  identifies  your  URL 

•  Not  necessary  when  serving  via  a  file://  URL 

Doesn’t  the  data  get  posted  up  to  Google? 

•  No,  Google  only  sees  you  requests  for  the  underlying 
map  images 

•  All  marker  placement  and  labeling  is  done  local  to  the 
client  with  overlays 
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geo-dst.html  (part  1) 


<ht  ml  xheadxt  i  1 1  e>l  P  Geolocation  Exampl  e</ti  tl  e> 

<s  c  r  i  pt  src="  htt  p:  /  /  maps,  googl  e.  com/  maps  ?f  i  I  e=api  &amp;  v=2&amp;  key  =" 
type=Mtext/j  avascri  pt "  ></  s  c  r  i  pt> 

<s  c  r  i  pt  type="text/j  avascri  pt"> 

II  This  is  the  file  that  contains  the  point  data 
v  a  r  ma  p ; 

v  a  r  x  ml  F  i  I  e  =  "  g  e  o  -  d  s  t .  x  ml  " ; 

II  Called  when  the  map  is  loaded.  This  function 
II  creates  the  map,  adds  controls  to  it,  and  then 
II  the  points  are  laid  on  top  of  the  map 
function  I  o  a  d ( )  { 

if  (  GBrowserl  sCompat  i  bl  e( ) )  { 

map  =  new  GMap2(  document .  get  El  ement  Byl  d( "  map" ) ) ; 
map. addControl (new  GLargeMapControl  ()); 
map. addControl ( new  GOvervi  ewMapCont  r  ol  ( ) ) ; 
map. addControl (new  GMapTypeCont  rol  ()); 
map. set  Cent er ( new  GLat Lng(  3  8,  -  9  7  ),  1); 

I  oadpoi  nt  s ( ) ; 


©  2007  Carnegie  Mellon  University  25 


©2007  Carnegie  Mellon  University 


25 


geo-dst.html  (part  2) 


II  htt  p:  /  /  code,  googl  e.  com/api  s  /  maps  /  document  at  i  on/servi  ces.  html  #XML_Requests 
function  I  oadpoi  nt  s ( )  { 

GDownl  oadUr  I  ( xml  Fi  I  e,  functi  on(data,  r  esponseCode)  { 
var  xml  =  GXml  .  par  se(  dat  a) ; 

var  markers  =  xml  .  documentEI  ement.  getEI  ementsByTagName("marker"); 
for  (var  i  =  0;  i  <  mar  ker  s .  I  engt  h;  i ++)  { 

var  point  =  new  GLatLng(  parseFI  oat(  markers!  i  ]  .  get  At  tribute! 11  lat 11 )), 
parseFI  oat(  markers!  i  ].  getAttri  bute("l  n  g 11 ) ) ) ; 
descr  =  markers!  i  ]  .  getAttri  bute(  "I  bl  ")  +";  "  +ma r ker s  [  i  ] .  get  At t r  i  but e( "  va  I  " ) ; 
map. addOver I  a  y ( new  GMar ker ( poi  nt ,  {ti tl  e: descr,  c I i  c ka bl  e : f a  I  s e  })); 

} 

}); 

} 

</  s  c  r  i  pt  ></  h e a  d > 

cbody  onl  o  a  d  = "  I  o  a  d  (  ) "  onunl  o  a  d  = "  GUnl  o  a  d  (  ) "  ><h2>l  P  Geolocation  Exampl  e</h2> 

<d i  v  id="  map"  styl  e="wi  dth:  640  px;  height:  480px"  ></  di  v> 

</  body > 

</  ht  ml  > 
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The  Results... 


IP  Geolocation  Example 
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Customizing  Marker  Icons 


Two  modifications  needed 

•  Define  the  different  icons  upon  initialization 

•  Choose  the  icon  when  points  are  added 


CERT 
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geo-dst-v2.html  (part  1) 


function  I o a d ( )  { 

if  (  GBrowserl  sCompat  i  bl  e( ) )  { 

map  =  new  GMap2(  document .  get  El  ement  Byl  d( "  map" ) ) ; 
map. addControl (new  GLargeMapControl  ()); 
map. addControl ( new  GOvervi  ewMapCont  r  ol  ( ) ) ; 
map. addControl (new  GMapTypeCont  rol  ()); 
map. setCenter( new  GLat  Lng(  38,  -97),  1); 

II  create  different  pins 

s  r  e  d  i  c  o  n .  i  ma  g  e  =  11  g  r  e  e  n  -  s .  p  n  g 11 ; 

sredi  con. shadow  =  " shadow- s.  png"; 

sr edi  con. i  conSi  ze  =  new  GS  i  z  e  ( 8 ,  13); 

s r edi  con. s hadowSi  ze  =  new  GS i z e ( 14,  13); 

sredi  con. i  conAnchor  =  new  GPoint(4,  12); 

sredi  con.  info  WindowAnchor  =  new  GPoint(5,  1); 

mr  e  d  i  c  o  n .  i  ma  g  e  =  "  r  e  d  -  m.  png"; 
mredi  con.  shadow  =  "  shadow- m.  png"; 
mr edi  con. i  conSi  ze  =  new  GS i z e ( 12,  20); 

I  oadpoi  nts(); 
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geo-dst-v2.html  (part  2) 


II  htt  p:  /  /  code,  googl  e.  com/api  s  /  maps  /  document  at  i  on/servi  ces.  html  #XML_Requests 
function  I  oadpoi  nt  s ( )  { 

GDownl  oadUr  I  ( xml  Fi  I  e,  functi  on(dataf  r  esponseCode)  { 
var  xml  =  GXml  .  par  se(  dat  a) ; 

var  markers  =  xml  .  documentEI  ement.  getEI  ementsByTagName("marker") ; 
for  (var  i  =  0;  i  <  mar  ker  s .  I  engt  h;  i ++)  { 

var  point  =  new  GLatLng(  parseFI  oat(  markers!  i  ]  .  get  At  tribute! 11 1  at")), 
parseFI  oat(  markers!  i  ].  getAttri  bute("l  n  g 11 ) ) ) ; 


} 


var  ratio  =  Math.log  (  parseFI  oat(markers[i  ].getAttri  bute("val  "))  / 
minval)  /  Math.log  (maxval  /  minval)  ; 

II 

II  Plot  the  pin  corresponding  to  the  logarithmic  ratio 


if  (ratio  <  0.2)  { 

map. addOverl  a y ( new  GMarker(poi  ntl 
}  else  if  (ratio  <0.9)  { 
map. addOverl  ay(new  GMarker( poi  ntl 
}  else  f 

map. addOverl  a y ( new  GMarker(poi  ntl 


st [  i  ] , 
st [  i  ] , 
st [  i  ] , 


{i  con:  sredi  con, 
{ i  con:  mredi  con, 
{i  con: I  r  edi  con, 


t  i  1 1  e:  de. .  . 
t  i  1 1  e:  de. .  . 
t  i  1 1  e:  de. . . 


r* 
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The  Results... 


IP  Geolocation  Example 
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Adding  Links 


Need  a  new  data  set 

•  Create  an  XML  file  with  source  location,  destination 
location  and  value 

•  Add  a  new  function  to  read  and  plot  the  data  file 


fCERT 


■  Software  Engineering  Institute  t  jirnrtjjirMHInn 


2007  Carnegie  Mellon  University  32 


|  Software  Engineering  Institute  CamegieMelkm 


©2007  Carnegie  Mellon  University 


32 


geo-dst-v3.html 


function  I  o  a  d  I  i  n  k  s  ( )  { 

GDownl  oadUr  I  ( xml  Fi  I  e,  functi  on(data,  r  esponseCode)  { 


va r  slink  = 
va r  el i  nk  = 
map.  addOver  I 


new 


ay 


GLat  Lng(  parseFI  o  a  t ( I i  nks[ i ]. getAttri  b  u  t  e ( 
p a r  s  e F I  oat  ( I  i  nks  [  i  ] .  get  At  t  r  i  but  e( 
GLat  Lng(  parseFI  o  a  t ( I i  nks[ i ]. getAttri  bute( 
parseFI  oat ( I  i  nks [ i ] .  get  At  t  r i  but  e( 

new  GPol  yline  ([slink,  elink], 
mnnn11 


'#000000" 


/ 


Color 


ratio  *  5,  ratio  /  2,  {geodes  i 


\ 


Opacity 


"si  at")), 
"sing"))) 
"el  at")), 
"el  ng" ) ) ) 

c :  t  r  u  e } ) ) 


Thickness 
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The  Results... 


IP  Geolocation  Example 
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Where  to  go  from  here 

Make  it  your  own 

•  Generate  info  window  popups 

•  Drag  markers 

•  Add  driving  directions 

See  http://code.qooqle.conn/apis/maps/ 

Download  sample  code  from  the  training  server 
(128.2.243.104)  in  /home/sfaber/presentation 
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Using  the  GoogleMaps  API 
for  Flow  Visualization 

Where  on  earth  is  my  data? 


Sid  Faber 

Network  Situational  Awareness  Group 
sfaber@cert.org 

Download  sample  code  from  the  training  server  (128.2.243.104) 
in  directory  / home/sfaber/presentation 
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Flow  Analysis  in  a  Wireless 
Environment  with  short  DHCP 

Leases 


Sanket  Parikh 
John  McHugh 

Dept,  of  Computer  Science 
Dalhousie  University 


FlowCon  2008 


DALHOUSIE 

UNIVERSITY 


Inspiring  Minds 


Project  Objectives 


Analysis  of  Wireless  Network  Data  from  University  of 
Dartmouth  (Crawdad  Archive) 

Adding  MAC  Layer  information  in  Net  Flow  tools  for 
identification  of  nodes  and  Activities  performed  by  a  node. 

Return  converted  flow  data  to  the  Crawdad  archive. 
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Project  Rationale 


The  main  issue  in  analyzing  wireless  network  data  from 
many  environments  is  the  assignment  of  temporary  IP 
Addresses  using  DHCP  with  short  leases. 

The  total  user  population  often  exceeds  the  available  address 
space,  and  a  given  user  may  connect  to  the  network  for  short 
sessions  from  a  number  of  different  locations  making 
complicating  per  platform  analyses. 

Work  to  date  has  concentrated  on  mobility  rather  than 
platform  behaviour. 
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The  Data 


160  GB  of  compressed  tcpdump  packet  headers. 

Collected  continuously  from  2  Nov  04  -  28  Feb  04 

1 8  collection  points  academic,  library,  residence 

Nothing  beyond  IP  Headers  except  TCP  ports  and  flags, 

UDP  ports. 

Anonymized  with  prefix  preserving  technique 

-  Usage  agreement  precludes  attacking  anonymization  to  determine 
user  identity. 

-  Low  order  24  bits  of  MAC  also  anonymized 

•  List  of  known  wireless  MAC  addresses  provided 

Sia  DALHOUSIE 
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Technical  Approach  -  1 


Tried  to  use  vlan  tag  fields  to  avoid  altering  YAF 
record  format. 

Use  the  Forward  and  Reverse  vlan  tag  fields  to  get 
source  and  destination  MAC  addresses  into  the  yafscii 

Since  these  are  16  bits  use  perfect  hash  of  MAC 

Problems: 

-  vlan  tag  is  in  unidirectional  extension  of  flow.  Need 
both,  even  for  unidirectional  flows. 

-  would  like  to  use  with  real  time  and  when  MAC  set 
not  completely  known 
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Technical  Approach 

We  added  MAC  to  the  bidirectional  flow  root  in  yaf,  with 
both  source  and  destination  MAC  addresses. 

There  are  a  number  of  subtleties  here,  including  the  use  of 
memcopy  that  introduces  field  order  dependencies  (an 
IPv4  optimization)  and  the  assumption  that  MAC  flag 
implies  vlanid  not  zero. 

Once  the  MAC  addresses  are  into  the  yafscii  output,  we 
started  converting  it  into  SiLK  for  further  data  analysis 

Shortly  after  we  finished,  CERT  added  MAC  address 
support  to  YAF  and  we  will  use  it  in  the  future. 


DALHOUSIE 

UNIVERSITY 


Inspiring  Minds 


Technical  Approach 


We  created  a  module  yafsciiltuc.  c 

-  Inserts  minimal  perfect  hash  index  of  MAC  in  in  /  out 

-  Adds  sensor  id  from  command  line  to  identify  the  sniffers. 

We  split  the  output  of  the  yafscii 2 tuc  into  separate  hourly 
streams  and  use  popen  to  send  each  one  to  a  separate 
invocation  of  rwtuc  so  that  the  resulting  files  are  in  a 
proper  date  hierarchy. 

We  also  use  rwsort  on  the  rwtuc  output  to  ensure  time 
order  and  because  rwtuc  does  not  compress. 
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Minimal  perfect  hashes 


A  Minimal  Perfect  Hash  maps  a  set  of  N  unique  strings  into 
integers  in  [0...N-1] 

-  Packages  available  on  internet  designed  for  null  terminated  strings 

-  Modified  for  counted  strings 

-  Extracted  all  MACS  from  Dartmouth  packet  data 

-  Grouped  to  bring  common  usages  together,  e.g.  known  wireless, 
gateways,  etc.  then  created  MPH 

-  1 7000+  MACs,  1 1 ,000+  with  IP  packets. 

Lookup  is  constant  time,  collision  free 
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Remaining  problems 


yaf  does  not  deal  with  decreasing  time  well 

-  In  live  capture,  packets  are  always  in  increasing  time  order  no 
matter  what  the  clock  says 

-  In  playback  the  same  holds  unless  the  file  has  been  reordered. 

-  Several  Dartmouth  sensors  exhibit  decreasing  time,  probably  due 
to  ntp  or  other  clock  adjustments. 

Data  from  one  of  the  sensors  “breaks”  the  pipe 

-  This  may  be  related  to  the  time  problem  above  or  may  be  due  to 
another  problem 

-  Truncated  packets  may  lead  to  other  pathologies  in  yaf 
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Next  steps 


We  want  to  reassign  the  IPs  currently  used  to  a  consistent  IP 
that  is  related  to  the  MAC  index. 

First  we  need  to  determine  if  any  wireless  IPs  are  associated 
with  gateway  MACs. 

-  This  would  occur  if  a  wireless  unit  talked  to  another  wireless  unit 
via  a  routed  connection,  e.g.  units  connecting  via  separate  sniffers. 

-  Start  by  creating  sets  for  each  MAC  type  and  looking  for 
intersections 

-  May  have  to  explore  DHCP  strategy  in  more  detail. 

This  is  currently  underway. 


DALHOUSIE 

UNIVERSITY 


Inspiring  Minds 


MAC  types 

There  are  5  categories  of  MACS  actively  involved 

-  Known  Wireless  MACs  with  IP  traffic 

-  Other  MACs  with  IP  packets 

-  Multi  cast  MACs 

-  Gateway  MACs 

-  Broadcast  MACs 

A  large  number  of  MACs  have  no  IP  traffic 

•  Some  appear  only  at  link  layer,  others  in  MAC  list  but  not 
seen 

We  used  rwfilter  to  build  sets  for  each  type  of  MAC  address 
based  on  the  input  and  output  field  values 
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Project  Outcomes 


We  found  some  interesting  information  during  analysis 
of  the  datasets.  There  are  traces  which  shows  some  IP 
addresses  appeared  in  two  different  sniffers  located  to 
different  locations. 

The  reason  may  be  the  physical  location  of  sniffers  for 
collecting  data.  Though  sniffers  were  not  located  at 
proper  distance  from  each  other,  there  might  be  the 
chances  for  getting  same  IP  traces  in  two  different 
sniffers. 

This  seems  improbable  and  needs  further  study 
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Next  Steps 


With  the  technique  we  used  for  this  research  should  prove  useful 
for  similar  data  from  wireless  “hot  spots”,  airport,  hotels  and 
convention  center  networks  and  more. 

Same  approach  can  be  used  to  analyze  data  by  using  MAC  layer 
information  in  Flow  Analysis  tools  to  identify  the  activities  and 
movements  of  nodes  in  Wireless  Networks. 
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Flow  Visualization  Using  MS-Excel 


Visualization  for  the  Common  Man 
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Background 


US-CERT  Mission 


Einstein  Program 

>  Large  volumes  of  traffic 

>  Architecture  limitations 


Proactive  vs.  Reactive  analysis 

Slow  application  certification 
process 
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Pro’s  and  Con’s 


Pro’s: 

-  Visualization  allows  for  rapid  analysis 

-  Patterns  are  easy  to  identify 

-  Flexibility  in  analysis 

-  Most  enterprises  have  MS  Office  (Excel) 

Con’s: 

-  Excel  plotting  engine  is  limited 

-  Max  of  65K  records  (recommend  <=  50K) 

-  Data  must  be  imported  and  formatted 

-  Memory  management  is  an  issue 
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Data  Preparation  Steps 


Data  Pull 


Data  Reduction 


Importing  Data 


Data  Formatting 


Sample  analysis  slides 
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Data  Pull 


Analysts  have  several  options  when  trying  to  pull 
interesting  datasets.  Several  methods  we  find 
useful  are: 


Collecting  data  during  non-business  hours 

-  Reduces  traffic  from  users;  helps  expose  automated  sessions 
Search  for  outbound  traffic  only 

-  Reduces  noise  from  scanning,  etc. 

Filtering  for  packets  with  the  PSH/ACK  flags  set  in  the  initial  flags  field 

-  Focuses  the  traffic  on  sessions  where  data  is  actually  transferred 
Filtering  for  packets  with  the  SYN  flag  set  in  the  initial  flags  field 

-  Focuses  on  sessions  initiated  by  your  organization 
Limit  traffic  to  records  under  5K  bytes 

-  Most  cyclical  sessions  (beaconing)  happen  in  this  range 


Traffic  should  be  refined  to  provide  the  best  possible  dataset  for  analysts  to 
work  with. 
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Data  Reduction 


w 

To  further  enhance  the  concentration  of  suspicious  data, 
analysts  should: 


Remove  replies  from  servers  (responses  to  inbound  server  requests) 

-  Looking  for  genuine  outbound  traffic 

Remove  loud,  common  talkers  (instant  messenger,  web  crawlers,  etc) 

-  Reduces  the  noise,  especially  in  web  traffic 
“Whitelists”  and  “blacklists”  are  helpful  for  filtering 


This  is  an  iterative  approach  -  Analyze,  Research,  Remove. 
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Importing  Data 


VD  st-cS 


Data  is  imported  from  a  pipe  delimited  text  file 


3  IP  | 

dIP 

sPort | 

dPort | 

pro  1 

packets | 

bytes | 

fli 

10.147.82.961 

10.130.166.158 

80 1 

5516| 

6| 

HI 

8253 | FS 

PA 

10.140.165.2181 

10.54.98.176 

80 1 

5705| 

6 1 

36 1 

12032 1 | FS 

PA 

10.95.46.1461 

10.34.134.191 

80 1 

5705| 

6 1 

1| 

40 1 

A 

10.94.132.1471 

10.168.141.231 

80 1 

49094| 

6 1 

10 1 

7343 | FS 

PA 

10.226.143.2191 

10.162.254.83 

30 1 

49297| 

6| 

33  | 

43533 | FS 

PA 

172.25M.165| 

10.161.142.75 

80 1 

47356| 

6| 

10 1 

7047 | FS 

PA 

10.120.9.2411 

10.36.140.83 

30 1 

47439| 

6| 

7| 

1339 | FS 

PA 

10.192.192.1301 

10.124.26.9 

5516| 

30 1 

6 1 

9 1 

945 | FS 

PA 

10.3.58.1411 

10.254.147.27 

5705| 

80 1 

6 1 

55 1 

3589 | FS 

PA 

10.215.49.1701 

10.30.5.168 

49094| 

80 1 

6 1 

3 1 

969 | FS 

PA 

10.207.158.1731 

10.15.150.60 

49297| 

80 1 

6 1 

22  | 

1666 | FS 

PA 

192.168.45.691 

10.33.159.210 

47356| 

30 1 

6| 

9 1 

1796 | FS 

PA 

10.227.193.1461 

10.237.117.172 

47489| 

80 1 

6 1 

7 1 

333 | FS 

PA 

10.115.234.2301 

10.144.241.122 

30 1 

24503| 

6| 

10 1 

6735 | FS 

PA 

10.52.224.1711 

10.232.170.176 

30 1 

24601| 

6 1 

7| 

1475 | FS 

PA 

10.144.199.781 

10.208.138.229 

80 1 

64021| 

6| 

10 1 

6437 | FS 

PA 

10.9.152.19| 

10.233.152.173 

30 1 

64124| 

6 1 

7 1 

1310| FS 

PA 

10.116.235.1161 

10.27.192.234 

80 1 

64021| 

6| 

1| 

40 1 

A 

10.0.158.2121 

10.131.10.198 

80 1 

40079| 

6 1 

HI 

6378 | FS 

PA 

10.40.145.1671 

10.229.195.82 

80 1 

40167| 

6| 

15 1 

15095 | FS 

PA 

10.40.157.251 

10.12.36.164 

80 1 

18275| 

6 1 

10 1 

6242 | FS 

PA 

10.33.232.601 

10.224.241.212 

30 1 

18335| 

6| 

III 

2 1377 | FS 

PA 

10.75.204.1911 

10.52.57.127 

24503| 

80 1 

6| 

3 1 

1439 | FS 

PA 

10.6.33.301 

10.218.34.41 

24601| 

30 1 

6| 

7 1 

333 | FS 

PA 

10.123.207.1871 

10.211.245.126 

64021| 

30 1 

6| 

9 1 

2544 | FS 

PA 

10.239.204.271 

10.140.242.63 

64021| 

80 1 

6| 

1| 

40 1  I 

a 

2007/ 10/2  9T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T15 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 
2007/ 10/29T16 


sTime| 
:  07:38.807| 
:  07  MO.  875 1 
:  07  M2. 473  | 
:  16: 19- 666| 
:  16 : 23 . 020 1 
:  28 : 2 1  M2 1 1 
:  28 : 23 .285 1 
:  07:38.771| 
: 07 MO. 843  | 
:  16 : 19 . 633 1 
:  16 : 23 . 004 1 
:  23 : 2 1 . 381 1 
:  23 : 23 . 2  61 1 
:  01 : 28 . 693 1 
:  01 : 29 . 42 1 1 
:  09:08. 791 1 
:  09: 09. 333 | 
:  09:03.951| 
:  16  MO .  536 1 
:  16:41.713 1 
:  24 : 32 . 546 1 
:  24 : 34 . 100 1 
:  01 : 28 . 654 1 
:  01 : 29 . 393 | 
:  09:08.762| 
:  09:08. 9511 


duu 

0.290 

0.552 

0.000 

0.825 

0.493 

0.859 

0.326 

0.292 

1.591 

0.825 

0.483 

0.374 

0.313 

0.220 

0.236 

0.161 

0.247 

0.000 

0.247 

0.317 

0.235 

0.409 

0.223 

0.225 

0.153 

0.000 


2007/ 10/29T15 : 07 : 
2007/ 10/29T15 : 07 : 
2007/ 10/29T15 : 07 : 
2007/ 10/29T15 : 16 : 
2007/ 10/29T15 : 16 : 
2007/ 10/29T15 : 28 : 
2007/ 10/29T15 : 28 : 
2007/ 10/29T15 : 07 : 
2007/ 10/29T15 : 07 : 
2007/ 10/29T15 : 16 : 
2007/ 10/29T15 : 16 : 
2007/ 10/29T15 : 28 : 
2007/ 10/29T15 : 28 : 
2007/ 10/29T16 : 01 : 
2007/ 10/29T16 : 01 : 
2007/ 10/29T16 : 09 : 
2007/ 10/29T16 : 09 : 
2007/ 10/29T16 : 09 : 
2007/ 10/29T16 : 16 : 
2007/ 10/29T16 : 16 : 
2007/ 10/29T16 : 24 : 
2007/ 10/29T16 : 24 : 
2007/ 10/29T16 : 01 : 
2007/ 10/29T16 : 01 : 
2007/ 10/29T16 : 09 : 
2007/ 10/29T16 : 09 : 
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Data  Formatting 


Columns  within  the  spreadsheet  should  be  aligned  to  each  field  of  the 
flows,  Einstein  data  is  formatted  to  encompass: 


Source  IP 

Destination 

IP 

Source  Port 

Destination 

Port 

Protocol 


Packets 
Bytes 
Flags 
Start  Time 
Duration 


End  Time 

Sensor 

Type 

Initial  Flags 


A 

B 

c 

D 

E 

F 

G 

H 

1 

J  K  L 

M 

N 

1  ' 

slp 

“0dlP 

[  *  (sPori  |  ▼  IdPorl  0 pro  0  packets  0 

byles  0  (lagsM  sTime  0 

dur0  eTime  0  sensor 

U  lype 

0inilialFlag0| 

[Ti 

I  X.X.  X.X 

y  y  y  y 

52811 

22 

6 

2568 

3385741  FS  PA 

2007/10/17100:10:40.722 

9  5  2007/10/17100  10:50  222  X 

out 

S 

hr 

X.X.  X.X 

yyyy 

7774 

22 

6 

138 

10750 

PA 

2007/10/17100:03: 28.ZEJ 

1795  891  2007/10/17100  38  23.984  X 

ou! 

A 

A 

jx.x.x.x 

yyyy 

7774 

22 

B 

106 

9046 

PA 

2007/10/17100:38:36.714 

1600.05  2007/1 0/1 7101  08.36  764  X 

out 

PA 

5 

X.  X.  X.  K 

yyyy 

1GB8 

22 

B 

1 

41 

A 

2007/10/17100  01  18  787 

0  2007/10/17100:01:18  707  X 

out 

A 

6 

X  X.X  X 

yyyy 

1688 

22 

6 

1 

41 

A 

2007/10/17100:06:18.690 

0  2007/10/17100:06.16  690  X 

out 

A 

T 

,X.1E.31.K 

yyyy 

1688 

22 

B 

1 

41 

A 

2007/10/17100  11  18  598 

0  2007/10/1710)11:18.590  X 

out 

A 

8 

x.x.x.x 

yyyy 

1688 

22 

B 

1 

41 

A 

2007/1 0/1 7100: 1 6:18.514 

0  2007/10/17TOO  16  18.514  X 

out 

A 

9 

x.x.xx 

yyyy 

1688 

22 

B 

1 

41 

A 

2007/10/17100:21  18.433 

0  2017/10/1710021:18  433  X 

out 

A 

*0 

X.  X.X  X 

yyyy 

1688 

22 

6 

1 

41 

A 

2007/1 0/17100:26:18.349 

0  2007/1 0717100:  26. 18.343  X 

out 

A 

11 

X.  XX.K 

yyyy 

1688 

22 

6 

1 

41 

A 

2007/1  D/1 7100:31  18.257 

0  2007/10/17100  31  18  257  X 

out 

A 

12 

X.X.  X.X 

yyyy 

1688 

22 

B 

1 

41 

A 

2007/10/17100:36:18.164 

0  2007/10/17100  3B  18.164  X 

out 

A 

M3 

X.X.  X.K 

yyyy 

1281 

22 

B 

956 

40630 

PA 

2007/10/17100  03:32  281 

1798  994  2007/10/17100  39:31  275  X 

out 

A 

jx.x.x.x 

yyyy 

1688 

22 

B 

1 

41 

A 

2007/10/17100:41  18.063 

0  2007/10/17100  41  18  068  X 

out 

'  A 

[  15 

X.X.  X.K 

yyyy 

1688 

22 

6 

1 

41 

A 

2007/10/17100:46  17  971 

0  2007/10/17100:46  17  971  X 

out 

A  | 

Data  Formatting  Cont 
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US-CERT  analysts  use  two  methods  to  format  the  Einstein  time 
fields  into  a  format  that  is  able  to  be  plotted: 

A:  Use  the  -  -  legacy-timestamps  switch  to  place  the  time  in  a 

MM/DD/YYYY  HH:MM:SS  format  from  the  default  MM/DD/YYYYTHH:MM:SS.MMM 

B:  Utilize  the  replace  function  in  excel  to  remove  the  milliseconds  from  the  time 
and  replace  the  T  placeholder  with  a  space: 


1 

SIR 

2 

X.X.X. 

3 

X.X.X. 

4 

X.X.X. 

5 

X.X.X. 

G 

X.X.X. 

7 

x.x.x. 

8 

X.X.X. 

9 

X.X.X. 

10 

X.X.X. 

11 

x.x.x. 

12 

x.x.x. 

13 

x.x.x. 

14 

15 

x.x.x. 

X.X.X. 

bJdlP  “  'EjsPort  "  LzJdPort  [j  Pm  Q  packets  M  bytes 

5261 1  22  6  2569  - - 
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Find  and  Replace 


Find  Replace 


Find  what: 
Replace  with; 


1 — ^ - 1 

.???  V  I 
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|  Options  »  |  | 

Replace  All 


Replace 


Find  Next 


Close 
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1688 


22 
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SPA 


Find  and  Replace 


Find  Replace 

Find  what;  T 
Replace  with; 


Options  » 


Replace  All 


Replace 


Find  All  |  |  Eind  Next  | 


Close 
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Plot 


Analysis  Workflow 


Zoom 


Highlight 


AutoFilter 
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Series  “sPort"  Point  "11/19/2007  5:07"  I 

- -  (11/19/2007  5:07,  63776) 

.  - .  •  ’♦:•>  -  •  ♦»  - z - : — 1 

v.;  *  yr/ 


**•*/ 


•/ 


/ 


sTime  6^JsPo[^JdP<|^Jp(^J 

packed 

byt(j-j 

1 

| 

11/19/2007  5:00  63776 

443 

6 

3 

164 

PA 

11/1 9/20C 

11/19/2007  5:03  63776 

443 

6 

10 

851 

PA 

11/1 9/20C 

11/19/2007  5:05  63776 

443 

6 

2 

102 

PA 

11/1 9/20C 

11/19/2007  5:07  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

Custom  AutoFilter 

B 

Show  rows  where: 

sPort 

equals 

V 

v 

®  And  O  Or 

V 

ICZ 

E 

Use  ?  to  represent  any  single  character 

1  Use  *  to  represent  any  series  of  characters 

OK 

DC 

Cancel  | 

11/19/2007  5:42  63776 

443 

6 

1 

62~ 

PA 

11/1 9/20C 

11/19/2007  5:44  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

11/19/2007  5:46  63776 

443 

6 

2 

102 

PA 

11/1 9/20C 

11/19/2007  5:47  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

11/19/2007  5:49  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

11/19/2007  5:51  63776 

443 

6 

2 

102 

PA 

11/1 9/20C 

11/19/2007  5:53  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

11/19/2007  5:55  63776 

443 

6 

3 

164 

PA 

11/1 9/20C 

11/19/2007  5:58  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

11/19/2007  6:00  63776 

443 

6 

2 

102 

PA 

11/1 9/20C 

11/19/2007  6:02  63776 

443 

6 

3 

322 

PA 

11/1 9/20C 

11/19/2007  6:04  63776 

443 

6 

6 

262 

PA 

11/1 9/20C 

11/19/2007  6:06  63776 

443 

6 

6 

284 

PA 

11/1 9/20C 

11/19/2007  6:09  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

11/19/2007  6:11  63776 

443 

6 

3 

142 

PA 

11/1 9/20C 

11/19/2007  6:13  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

11/19/2007  6:14  63776 

443 

6 

4 

204 

PA 

11/1 9/20C 

11/19/2007  6:19  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

11/19/2007  6:20  63776 

443 

6 

2 

102 

PA 

11/1 9/20C 

11/19/2007  6:22  63776 

443 

6 

3 

142 

PA 

11/1 9/20C 

11/19/2007  6:24  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

11/19/2007  6:26  63776 

443 

6 

2 

102 

PA 

11/1 9/20C 

11/19/2007  6:28  63776 

443 

6 

1 

62 

PA 

11/1 9/20C 

11/19/2007  6:30  63776 

443 

6 

2 

102 

PA 

11/1 9/20C 
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Zoom 
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You  can  “zoom”  in  to  specific  data  points,  by 
changing  the  scale  of  the  axis 

•  Right  click  on  the  axis 

•  Select  “Format  Axis” 

•  Click  on  the  “Scale”  tab 

•  Adjust  scale  as  desired 

•  Works  for  both  axis 

•  Remember  to  remove 
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Format  Axis. 
Clear 
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Highlight 


By  hovering  over  a  data  point  in  the  series  an 
analyst  can  locate  the  point  in  the  rest  of  the 
records  by  filtering  for  the  displayed  information 
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Series  "SrcIP"  Point  "1 L/J  8/2007  8:35" 
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Method  A  -  Drop  down  list: 
Select  the  desired  value  from  the 
drop  down  list 


Method  B  -  Custom  Filter: 

Select  data  by  using  Excel’s  built  in 
boolean  logic  search  functions 
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Study  Conclusion 


After  notifying  the  agency  in  question,  the 
machines  that  were  generating  this  traffic 
were  found  and  forensically  examined.  The 
malware  turned  out  to  be  a  keystroke  logger 
that  posted  data  to  a  specific  website  and 
retrieved  commands  embedded  on  the  same 
site.  Prior  to  this  incident,  there  was  no 
malware  associated  with  this  site. 
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Determining  application  patterns 

-  Identifying  specific  applications 


Working  with  gateway  traffic 

-  Structured  gateway 

-  Proxy  gateway 

-  Gateway  mannerisms 
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Contact  Info 


s 


Technical  comments  or  questions 

-  US-CERT  Security  Operations  Center 

-  Email:  soc@us-cert.gov 

-  Phone:  +1  888-282-0870 

Media  inquiries 

-  US-CERT  Public  Affairs 

-  Email:  media@us-cert.gov 

-  Phone:+1  202-282-8010 

General  questions  or  suggestions 

-  US-CERT  Information  Request 

-  Email:  info@us-cert.gov 

-  Phone:+1  703-235-5111 

For  more  information,  visit  http://www.us-cert.gov 
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Overview 


■  Introduction  to  Bloom  Filters 

■  Overview  of  ClAC’s  Bloom  Filter-Based  indexing  System 

■  Approach's  Applicability  for  CIAC  &  other  CERTs 

■  Performance  on  Actual  Flow  Data 

■  Applications  of  Approach  in  Conjunction  With  Analytical 
Tools 

•  Facilitating  incident  detection  and  analysis  with  flow  visualization 
tools. 
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A  Very  Brief  Introduction  to  Bloom  Filters 
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Introduction  to  Bloom  Filters 


High-level  Functionality  -  trivial 


► 


Answer: 

“Yes” 


► 


Answer: 

“No” 


http://www.eecs.harvard.edu/-michaelm/NEWWQRK/postscripts/BloomFilterSurvey.pdf 

http://en.wikipedia.org/wiki/Bloom_filter 
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Introduction  to  Bloom  Filters 


■  The  Concept 

•  Efficient,  probabilistic  data  structure,  providing  extremely  light¬ 
weight  string  lookups,  or  “approximate  membership  queries”. 

•  Invented  by  Burton  Bloom  in  1970  to  optimize  spellchecking. 

•  Trade-off  small  probability  of  false  positives  for  massive  gains 
in  space  and  time  efficiency. 

•  Popular  for  various  large-scale  network  applications  (e.g.,  web 
caches,  query  routing). 


References: 

http://www.eecs.harvard.edu/-michaelm/NEWWORK/postscripts/BloomFilterSurvey.pdf 

http://en.wikipedia.org/wiki/Bloom_filter 
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How  Bloom  Filters  Work 


1.  Empty  bloom  filter  is  a  bit  array  of  m  ‘O’-  bits. 


2.  Introduce  k  different  hash  functions,  each  maps 
key  value  to  one  of  m  array  positions. 


3.  Insert  element  by  feeding  it  to  each  hash  function, 
to  obtain  k  array  positions.  Set  these  bits  to  ‘1’. 


4.  Query  element  (check  its  existence)  by  re-feeding  into 
each  hash  function,  and  checking  corresponding  bit 
positions.  If  all  bits  are  ‘1’,  then  element  is  either  in  the 
filter  or  it’s  a  false  positive. 


5.  If  bit  positions  of  hashes  of  an  element  contain  a  ‘O’, 
then  that  element  is  definitely  not  in  filter  (no  false 
negatives). 
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Introduction  to  Bloom  Filters 


■  False  Positives 

•  Probability  of  false  positive  for  a  populated  bloom  filter  is: 

p(FP 

Probability  of  False  Positive 


m In  (filter  bits/element) 


k  -  number  of  hash  functions  used 
n  -  number  of  elements  inserted 
m  -  size  of  bloom  filter  (bit  array) 
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Bloom  Filters  -  Summary 


•  Quick  test  of  element  membership: 

•  0  likelihood  of  false  negatives 

•  Tunable  false  positive  rates 

•  Probability  of  collisions  proportional 
to  the  number  of  elements  in  set  & 
inversely  proportional  to  filter  size. 

•  Enforce  maximum  false  positive 
threshold  by  tuning  filter  size: 

•  Often  require  as  little  as  one  byte  per 
element 


Functionality 
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Significant  space  and  time 
advantages  over  many  standard, 
deterministic  indexing  structures: 

•  Self-balancing  trees 

•  Tries 

•  Hash-Tables 

•  Arrays,  Linked  Lists 

Query  time  is  O(k),  independent  of  number 
of  items  in  set. 


Many  open  source  implementations 
available. 


Inexpensive,  easy  to  deploy  and  maintain 
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Bloom  Filters: 

Operational  Viability  for  CIAC  and 
the  CERT  Community 
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ClAC’s  Flow  Collection  Review 


CIAC  collects  massive  volumes  of  biflow  data  from  29  sensors  across 
the  DOE  complex: 

•  300-500  million  biflows  daily  (~4600/s) 

•  -14GB/94GB  compressed/uncompressed  daily 


Approximate  daily  averages  by  sensor 


a  Average  #  records  per  day 
■  Min  #  records  per  day 
a  Max  #  records  per  day 
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ClAC’s  Flow  Collection  Review 


■  biflow  feed: 

•  Session  summary 

•  Fields: 

-  Date/Time  &  Duration 

-  Source/Destination  IP  and  Port 

-  Protocol  Information 

-  Bidirectional  Byte  and  Packet  Counts 

-  Bidirectional  Protocol  Options 

-  Subset  of  TCP/ICMP  flags 


Example  Biflow  Record 


1171066191.997532, 20070210000951.997532, site3,Ho30, 6, 192168081021, 192, 168, 81, 21, IT, 010000001008, 10, 0,1, 8, US, 53, 1024, 0,0, 0.0000, 0,0, 54, 0,1, 0,0, 0,0, 0,0, 60, 0,60,0, ,,14,00, +14, 0,0, 0,0 
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CIAC  Analysis  -  Legacy  Search  Methodologies 


■  File  grep 

•  Search  sensors  and  hours  for  range  of  interest  (e.g., 
“site3,  sitel  2,  site21  from  10/1/06  through  12/31/06”). 

•  Requires  reading/decompressing  and  combing  through 
GBs  of  data  (from  disk)  for  every  day  searched. 

■  RDBMS  -  Oracle 

•  SQL+ 

•  Perl/JDBC 

•  Typically  limited*  to  past  ~25  days  of  bi-directional 
sessions  (-15%) 

■  AWARE  web  portal 

•  High-level  charting  and  statistics  (session  counts,  etc.) 


Biflow 

DB 


Many  mission-critical  searches  can  take  several  hours  or  days  to  complete 

Lawrence  Livermore  National  Laboratory  - 
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Data  Volume 


Current  CIAC  Analysis  Data  Flow 


Analysis  Technique  Data  stores 


Analysis  tools 
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Watch  and  Warn  Query  Needs  and  Issues 


■  Rapidly  search  all  flow  data  over  long  periods  of  time: 

•  Analysts  typically  search  on  IP  address: 

Watch  list  (suspicious,  known-bad,  etc.) 

Nodes  of  interest 

Compromised  internal  nodes 

•  Various  time  (hours,  days,  months)  and  space  (single  site,  all  sites) 
scales. 

•  Require  quick  turnaround  (minutes)  to  respond  to  site  requests: 

e.g.  “Have  you  seen  these  IPs  at  my  site  in  the  past  3  weeks?” 

■  IP-based  searches  often  yield  relatively  small  result  sets: 

•  “Interesting”  IP  might  only  have  been  seen  in  30  site-hours,  whereas  21 ,600 
hours  (~1  DOE-month)  might  have  been  searched. 

99.9%  wasted  duty  cycle! 

•  Need  to  reduce  the  search  space  (raw  flow  files)  through  better  cataloging  of 
data  as  it  arrives. 
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Bloomdex : 

ClAC’s  Bloom  Filter-based  Indexing  System 
for  Network  Flow  Analysis 
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Solution:  Bloomdex 


■  Bloomdex 

•  A  hybrid  hierarchy/file-based  Bloom  filter  system  to  index  ClAC’s 
biflow  records. 

•  Currently  indexed  by  source  or  destination  IP. 

•  Index  partitioned  by: 

-  Site-month  (e.g.,  “SITE8  1 1/2006”) 

-  Site-day  (e.g.,  “SITE8  11/5/2006”) 

-  Site-hour  (e.g.,  “SITE8  11/5/2006  13:00”) 

•  Uses  intuitive  directory  tree  structures  and  multi-scale  bloom 
filters  to  accelerate  IP-based  searches. 

•  max(FP  rate)  «  2x1 0-4  ->  3  bytes  of  storage  per  unique  IP 
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Data  Volume 


Blooomdex  -  CIAC  Analysis  Data  Flow 


Analysis  Technique  Data  stores 


Analysis  tools 


I 


Human  Inspection 


Graph  Analysis 
Visual  Analysis 


Charting 


Algorithmic 
Trending 
Historical  Queries 


Statistics  /  Aggregation 
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Reducing  the  Biflow  Search  Space 
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Bloomdex: 
Performance  Profile 
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Bloomdex:  Comparative  Performance  Profiles 


Typical  analyst  IP-based  queries: 


IPs 

searched 

Date-range 

searched 

Site-hours 

searched 

Site-hour 

hits 

%  Site- 
hour  hits 

Session 

hits 

Raw 
biflow 
file  hits 

Search  time 
(conventional) 

Search  time 
(bloomdex) 

Relative 

Speedup 

8 

12/13/06- 

1/9/07 

19,140 

466 

2.43% 

10,594 

600 

16.29  hours 

1.3  hours 

12.5 

13 

10/15/06- 

1/17/07 

65,888 

1,166 

1 .77% 

158,345 

1,667 

57.52  hours 

3.45  hours 

16.7 

13 

1/22/07  - 
1/29/07 

4,959 

31 

0.60% 

78 

39 

4.16  hours 

5.82  minutes 

42.9 

4 

1/1/07  - 
1/2/07 

725 

3 

0.41% 

3 

3 

21.5  minutes 

28  seconds 

46.1 

9 

1/23/07  - 
1/24/07 

725 

1 

0.14% 

A 

1 

1 

41.7  minutes 

41  seconds 

61 

•  Expect  >1  Ox  speedup  / 

•  Strong  dependency  on  site-hour  hit  ratio 

•  Future  optimizations  to  search  tools  could  make  it  even  faster 
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Bloomdex:  Performance  Profile 


Comparative  Performance: 


□  Strong  relationship  between  speedup  and  site-hour  hit  ratio 

□  Ideal  for  searches  on  sparsely-occurring  IPs 


Lawrence  Livermore  National  Laboratory 


UCRL-PRES-236738 


DOE  Computer  Incident  Advisory  Capability  (CIAC) 


21 


Bloomdex:  Performance  Profile 


■  Bloom  filter  generation  performance: 

•  Average  site-day  filter  generation  rate: 

-  ~  33/hour  =  792/day  (current  incoming  rate:  29/day) 

•  Average  site-hour  filter  generation  rate: 

-  ~  390/hour  =  9360/day  (current  incoming  rate:  696/day) 


Will  scale  well  to  100+  sites  (cheaply) 
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Bloomdex:  Status 


■  Coverage 

•  2.5  years  of  biflow  records  indexed. 

■  Storage  footprint 

•  3  bytes  per  unique  IP  at  the  site-hour,  site-day  and  site-month 
levels. 

•  Bloom  filters  currently  using  -200GB  of  shared  storage. 

■  Exploring  additional  space  and  performance-based 
optimizations 

•  Other  dimensions  (e.g.,  port,  ip-port,  srcip-dstip  pairs) 

•  Counting  Bloom  filters 

•  Different  hashing  functions 

•  Parallelization 
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Bloomdex : 

Analyst  Workflow  Integration 
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10Gb  Packet  Capture 


Analyst  Workflow  Integration 


HigPT 
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Data  Volume 
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Facilitating  Incident  Analysis  with  Bloomdex  and 
Everest  Flow  Visualization 


■  Example  Use  Scenario: 

1 .  Site  reports  compromise 

-  Supplies  4  suspect  IPs  to  CIAC. 

2.  CIAC  queries  biflow  data  for  suspect  IPs  using 

Bloomdex  query  tool: 

-  Search  all  sensors  over  a  sufficient  time  range  (perhaps  a 
full  year). 

-  Quickly  identify  several  other  sites  with  hosts  exhibiting 
similar  behaviors. 

-  Analysis  set  narrowed  down  to  just  1 ,635  sessions. 
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Analysis  Using  Bloomdex  and  Everest  (2) 


3. 

4. 


Launch  Everest  graph  visualization  tool,  point  to  Bloomdex  output 
file  containing  result  set  (1,635  biflow  records). 

Issue  general  query  to  generate  session  graph: 


File  Edit  View  Insert  Query  Tools  CIAC  Connection  Workspace  Window  Help 
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Analysis  Using  Bloomdex  and  Everest  (3) 


5.  Perform  drill-down  or  aggregate  analysis 
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Analysis  Using  Bloomdex  and  Everest  (4) 


6.  Perform  in-depth  or  summary  analysis 
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Conclusion 


The  Bloomdex  suite  enables  significantly  faster  turnaround 
times  on  analyst  IP-based  queries: 

•  It  does  this  by  drastically  narrowing  the  search  space  through 
Bloom  filter  pre-queries. 

•  Facilitates  use  of  other  analytic  tools,  such  as  Everest. 

•  Provides  significant  space  savings. 

•  Very  straightforward  and  inexpensive  to  deploy  and 
maintain. 

■  Future: 

•  Utilize  compressed  bitmap  indexes  as  an  integrated 
indexing/retrieval  solution. 
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Questions 


cdr  [at]  llnl.gov 
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Visualizations  are  Tools 
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Time  Series: 

The  Tried  and  True  Hammer 
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Time  Series 


==■  Software  Engineering  Institute  Carnegie  Mellon 


7 


Time  Series 
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Existence  Plots 
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Plotting  Relationships 
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Plotting  Relationships 

US  commercial  paper  and  Treasury  bills  Dollar  libor  spreads 

3  month  rates  (%}  Over  Fed  Funds  target  rate 

-  Treasury  Bills  points) 

=  Nsnr  nancial  J2/P  Z  raled 

7  hH”  financial  (asset  biiek£d)  AA  ralud  ** 


Overn^ht 
3  months 


'em 


Software  Engineering  Institute  Carnegie  Mellon 


12 


ft  Bytes  in  Flow 
{to  95th  Percentile) 


742 


smtp.example.com  Bytes  Against  Packets 
12/01/2007 


#  Packets  in  Flow 


(cm  =■ 


CERT  Software  Engineering  Institute  Carnegie  Mellon 


13 


www.example.com  Bytes  Against  Packets 
12/01/2007 


It  Bytes  in  Flow 
(to  95ih  Percentile) 


gate  way,  example,  com  -  Bytes  Against  Packets 
12/01/2007 


CERT 


Software  Engineering  Institute  CarnegieMelkm 


15 


MSI  [«,T i(un^T,ti7n'  6yln  asaNI  J V*f (' 
rttauiap 


CERT 


Software  Engineering  Institute  CarnegieMelkm 


16 


#  Bytes  in  flow 
(to  95th  Percentile) 


wwvv.exarnple.com  -  Bytes  Against  Packets 
12/01/2007 


00:00:00-03 : 59 : 59  04:00:00-07: 59 : 59 

08:00:00-11:59:59 

12:00:00  15:59:59 

16:00:00  19:59:59 

20:00:00-23:59:59 

|Qrm  -•==■  Software  Engineering  Institute 

Carnegie  Mellon 

17 

Plotting  Distributions: 
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The  X  axis  is  bytes  per  packet  in  5-byte  increments.  The  Y  axis  shows 
the  quantity  of  flows  in  each  bin.  Red  indicates  flow  activity  by  known 
scanners. 
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Hilbert  Curve: 
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Hilbert  Curve  (The  Movie) 
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Additional  Resources 


The  R  Project.  Introduction  to  R.  Chapter  13:  Graphics,  http://cran.r- 
proiect.orq/doc/manuals/R-intro.htnnl#Graphics 


Tufte,  E.  R.  The  Visual  Display  of  Quantitative  Information.  Cheshire,  CT: 
Graphics  Press,  1983. 


Tufte,  E.  R.  Envisioning  Information.  Cheshire,  CT:  Graphics  Press,  1990. 


Tufte,  E.  R.  Visual  Explanations:  Images  and  Quantities ,  Evidence  and 
Narrative.  Cheshire,  CT:  Graphics  Press,  1997. 


Tufte,  E.  R.  Beautiful  Evidence.  Cheshire,  CT:  Graphics  Press,  2006. 


Wilkinson,  L.,  etal.  The  Grammar  of  Graphics.  New  York:  Springer-Verlag, 
1999. 
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Visualization  as  an  Analysis  Tool: 

Presentation  Supplement 


This  document  is  a  supplement  to  the  presentation  “Visualization  as  an  Analysis  Tool”  given  by  Phil  Groce  and  Jeff  Janies  on 
January  9,  2008  as  part  of  FloCon  2008.  The  intent  of  the  presentation  was  to  demonstrate  how  simple  tools  used  together 
can  provide  significant  insight  into  network  behavior. 

This  supplement  includes  annotations  to  the  presentation  slides  and  a  set  of  key  points  in  the  presentation,  as  well  as 
supplementary  points  that  were  not  included  for  the  sake  of  brevity.  Except  where  noted,  all  examples  are  drawn  from  real 
network  flow  data. 

Key  Points 

Always  tell  the  truth.  Many  of  the  same  techniques  for  making  the  eye  aware  of  important  distinctions  in  the  data  can  also 
highlight  unimportant  distractions,  or  even  create  false  impressions  that  the  data  do  not  support.  Always  ensure  that  the 
visualization  presents  an  accurate  picture  of  the  data. 

Learn  how  to  use  your  tools.  It's  better  to  have  a  limited  set  of  tools  you  know  how  to  use  than  a  whole  bag  of  tools  you  don't 
understand.  If  nothing  else,  understand  the  limitations  of  the  tools  so  you  know  when  your  question  demands  a  new  one. 

Facilitate  direct  comparison.  Use  human  perception  to  your  advantage.  Put  comparable  visualizations  on  the  same  page  so 
people  can  see  them  in  peripheral  vision  and  switch  back  and  forth  quickly.  Align  visualizations  along  common  axes.  Make 
sure  similar  things  look  similar  (e.g.,  by  using  the  same  scales). 

Combine  complementary  visualization  techniques.  Use  visualizations  whose  insights  complement  each  other  in  ways  that 
facilitate  easy  comparison  between  them. 

Tables  are  visualizations,  too.  For  small  sets  of  data,  a  table  may  be  the  best  way  for  people  to  consume  the  data.  Compare  the 
following  (sample)  data  formatted  as  a  pie  chart  and  a  table: 


•  HTTP 

•  SMTP 

•  HTTPS 

•  DNS 

•  Other 


Traffic  by  Port 


Port 

Traffic 

(%  Volume) 

HTTP 

42 

SMTP 

35 

HTTPS 

13 

DNS 

6 

Other 

5 

The  table  takes  less  space  to  communicate  the  same  information,  and  it  is  easier  to  map  the  type  of  traffic  to  the  value.  In  the 
pie  chart,  the  reader  must  consult  a  key  to  map  a  color  to  a  traffic  type,  then  find  the  color  on  the  chart.  (Putting  the  type  label 
directly  on  the  pie  slice  causes  problems  with  the  smaller  pieces;  using  callout-style  labels  makes  the  pie  chart  even  larger.) 
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Distinguish  between  useful  and  useless  precision.  The  right  numbers  in  the  right  place  are  critical  to  understanding  a 
visualization.  Examples  include  maximum  and  minimum  observed  values;  selected  local  maxima  and  minima;  and  relevant 
baselines  such  as  the  start  and  end  values  of  the  scale,  medians  or  other  useful  “average”  values,  and  specific  “important”  data 
points  (e.g.,  a  point  representing  an  important  host  or  a  flow  thought  to  be  the  source  of  a  compromise). 

In  general,  if  the  exact  numbers  don’t  provide  an  important  perspective  on  the  whole  visualization,  it's  probably  a  distraction. 
If  all  the  numbers  are  important,  the  best  visualization  may  be  a  table. 

Consider  sampling.  For  constant-magnitude  data,  a  visualization  over  a  sample  of  the  data  may  tell  the  story  as  convincingly 
as  a  visualization  over  the  full  set  of  data.  If  the  generation  of  the  visualization  takes  significant  time,  sampling  may  improve 
performance. 

Choose  your  display  media  wisely;  don't  underestimate  paper.  The  primary  display  media  available  to  most  analysts  are 
paper  and  computer  monitors.  In  our  experience,  analysts  overwhelmingly  use  monitors  over  paper.  Both  have  advantages, 
however.  Computer  screens  allow  types  of  interactivity  that  paper  cannot,  and  digital  copies  of  visualizations  can  be  easily  sent 
electronically.  Visualizations  on  paper  can  be  read  without  special  equipment,  and  annotated  with  only  a  pencil.  Moreover, 
paper  resolution  ranges  from  300  to  1000  dpi.  Screen  resolution  typically  ranges  from  72-100  dpi. 

For  comparison,  here  is  the  same  scatterplot  rendered  at  300  dpi  and  100  dpi: 


Use  the  appropriate  number  of  dimensions.  Paper  and  screens  are  naturally  two-dimensional;  color,  perspective  and  motion 
can  provide  some  additional  dimensionality,  but  will  never  be  as  effective  as  length  and  width  on  a  flat  screen  or  piece  of  paper. 
For  example,  decorating  a  scatterplot  axis  with  a  histogram  of  the  data  along  that  axis  communicates  an  additional  dimension 
of  data  density  often  plotted  to  lesser  effect  with  color  or  an  isomorphic  rendering  of  a  three-dimensional  plot. 

Make  everything  explicit.  When  generating  visualizations  for  personal  consumption,  the  most  important  thing  is  finding  an 
insightful  view  on  the  data.  When  passing  visualizations  to  others  (including  yourself  in  the  future),  annotate  the  visualization 
with  everything  required  to  understand  where  the  data  came  from  and  what  processing  has  been  done  on  it.  (E.g.,  the 
command  used  to  extract  the  data,  how  the  data  may  have  been  trimmed  for  readability,  what  other  transformations  have  been 
done  on  the  data.)  In  particular,  if  your  scale  doesn’t  start  at  zero  for  linear  scale  plots  or  one  for  log  scale  plots,  make  a  special 
point  of  noting  it. 
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Basic  Visualization  Types 


Time  series 

Relating  data  to  time  is  so  intuitive  and  useful  that  it  risks  being  used  to  the  exclusion  of  other  visualization  tools.  It  makes 
sense,  then,  to  optimize  time  series  visualizations  as  much  as  possible. 


A  common  problem  in  time  series  data  is  relating 
multiple  independent  series  (e.g.,  volume  measurements 
of  bytes,  packets  and  flows)  by  time  to  give  a  clear  picture 
of  an  event.  Often,  the  scales  of  each  series  diverge  too 
widely  to  plot  on  the  same  scale  without  losing  detail. 
Using  different  scales  for  each  series  is  an  option  if  the 
designer  can  communicate  this  decision  clearly  to  the 
reader.  Another  (often  simpler)  solution  is  to  plot  the 
series  independently  and  align  them  on  a  shared  scale,  as 
in  the  figure  to  the  right. 


When  comparing  very  large  numbers  of  series,  the  above 
approach  breaks  down,  unless  the  data  is  very  tightly 
coupled  (e.g.,  EKG  or  seismic  data).  The  existence  plot 
trades  measurement  resolution  for  scale  by  defining  value 
ranges  and  plotting  each  series  as  a  single  line,  colored  by 
the  range  in  which  the  value  for  that  time  resides. 

To  mitigate  the  loss  of  resolution,  it  is  often  useful  to  pair 
an  existence  plot  with  a  traditional  time  series  that  relates 
to  all  the  series  in  the  existence  plot. 
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Bytes  Packets  Flows 


Visualizing  Relationships 


Time  series  plots  are  a  specific  instance  of  relating  one  dimension  with  another  (time).  Scatterplots  are  generalizations  of  this 
approach.  By  plotting  points  in  space,  the  eye  can  perceive  relationships  between  these  dimensions  as  lines,  shapes  or  other 
patterns. 


These  three  machines  serve  three  very  different  network 
roles.  (Mail  server,  web  server  and  gateway,  respectively.) 
This  is  reflected  in  the  different  (but  consistent)  ratios 
between  bytes  per  flow  and  packets  per  flow  in  their 
traffic.  This,  in  turn,  is  visible  in  the  very  distinctive 
'shapes'’  their  traffic  takes  when  these  dimensions  are 
plotted  against  each  other. 

In  this  plot,  multiple  points  that  share  exactly  the  same 
values  show  up  as  a  single  point.  The  next  section 
addresses  this  deficiency. 
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Visualizing  Distributions 


There  are  many  ways  to  visually  analyze  the  distribution  of  sets  of  single  values — this  document  focuses  on  histograms,  but  box 
plots,  whisker  plots  and  violin  plots;  CDF  (cumulative  distribution  function)  plots  are  all  in  common  use. 


The  distribution  of  bytes  per  packet  for  flows  associated 
with  a  given  host.  Most  connections  fall  within  a  few 
narrow  ranges  of  values.  As  with  the  scatterplots  above, 
the  "shape"  of  the  histogram  is  characteristic  of  the  host 
behavior. 


Because  of  the  physical  dimensions  of  a  histogram,  and 
the  fact  that  it  operates  on  a  single  (data)  dimension,  it 
complements  scatterplots  well  to  indicate  density  of  data. 

In  the  example  at  right,  the  distribution  of  values  gives  no 
indication  that  a  disproportionate  number  of  flows 
visualized  have  low  packet  and  byte  counts.  (267,400 
flows  have  6  packets;  21,700  flows  contain  only  432 
bytes.) 


eaampte.com  Flow  Votixne.  Binned  by  Byte*  pec  Packet 
2007/12/01 


Byte*  pet  packet 


I  I  I 


Additional  Reading 

The  R  Project.  Introduction  to  R.  Chapter  13:  Graphics,  http :/ / cran.r-project.org/ doc/ manuals/R-intro.html#Graphics 
Tufte,  E.  R.  The  Visual  Display  of  Quantitative  Information.  Cheshire,  CT:  Graphics  Press,  1983. 

Tufte,  E.  R.  Envisioning  Information.  Cheshire,  CT:  Graphics  Press,  1990. 

Tufte,  E.  R.  Visual  Explanations:  Images  and  Quantities,  Evidence  and  Narrative.  Cheshire,  CT:  Graphics  Press,  1997. 
Tufte,  E.  R.  Beautiful  Evidence.  Cheshire,  CT:  Graphics  Press,  2006. 

Wilkinson,  L.,  et  al.  The  Grammar  of  Graphics.  New  York:  Springer-Verlag,  1999. 
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One  Year  of  Peer  to  Peer 


Ron  McLeod,  BCSc,  MCSc. 

Director  -  Corporate  Development  Telecom  Applications 

Research  Alliance 

Doctoral  Student,  Faculty  of  Computer  Science,  Dalhousie 

University 


Presentation  Summary 


This  presentation  will  profile  the  result  of  the  growth  in  peer-to-peer  applications  on  a 
sample  network  and  describe  the  resultant  massive  increase  in  the  diversity  of  traffic.  This 
diversity  impacts  the  ability  to  profile  baseline  normative  behaviour  using  Blind  Flow 
Analysis. 

I  will  also  briefly  discuss  the  application  of  SiLKtools,  Neural  Networks  and  Bioinformatic 
strategies  to  Blind  Flow  Analysis  of  real  world  security  problems  and  how  that  analysis  is 
affected  by  the  growth  in  recreational/user  driven  applications. 

What  began  as  a  basic  design  principal  of  end-to-end  management  with  popular 
applications  in  recreational  computing  is  quickly  becoming  a  dominant  evolutionary  force  in 
network  traffic  patterns. 

Traffic  patterns  are  becoming  emergent  properties  influenced  by  the  voluntary  adoption  of 
new  systems  by  individuals  without  any  collective  intent. 

The  network  is  evolving  at  the  edges. 

“Peer-to-Peer  is  the  basic  design  of  the  Internet”  -  Christian  Huitema 


Sample  Network  Description 


•  A  Multi-tenant  Commercial  Network  consisting  of: 

-  40  user  assigned  hosts,  actual  number  subject  to 
minor  fluctuations  over  time. 

— 40  special  hosts  not  assigned  to  individual  users. 
These  hosts  form  parts  of  various  temporary 
development  and  experimental  environments. 

-  Users  were  apprised  that  Network  flow  data  was  now 
being  captured  for  experimental  and  management 
reasons. 

-  Payload  data  was  neither  collected  nor  examined. 

-  Analysts  did  not  have  access  to  the  content  of  specific 
hosts  for  further  investigation. 

-  For  confidentiality  reasons  the  identity  of  the  Network  is 
not  specified  in  this  Presentation. 


A  Review  of  Blind  Flow  Analysis 


The  Need  for  Classification  Based  on  Minimal  Information  (the 
extreme  case  in  the  world  of  tomorrow) 

Capturing  and  examining  payload  contents  is  widely  viewed  as  a  potential  violation  of 
privacy  and  placed  in  a  category  similar  to  listening  in  on  a  telephone  call. 

Even  attempts  to  use  information  derived  from  the  payload  (such  as  ngrams)  do  little  to 
alleviate  the  fundamental  concern  of  the  user  surrounding  access  to  the  payload. 

In  multi-tenant  commercial  environments  this  user  concern  may  be  based  in  protection 
of  commercial  confidentiality. 

There  is  less  (although  not  zero)  concern  among  the  user  community  with  regard  to  the 
capture  and  investigation  of  packet  header  data  (some  concern  for  Source  and 
Destination  IP’s  and  MAC’s). 

Therefore,  the  network  analyst  may  be  limited  to  examining  a  severely  reduced  subset  of 
the  packet  header  information  in  an  attempt  to  determine  if  the  system  under  their 
management  (or  monitoring)  is  operating  properly  or  experiencing  anomalous  behavior. 

The  loss  of  access  to  the  originating  address  information  means  that  the  analyst  no 
longer  has  access  to  a  unique  field  in  the  data  that  identifies  the  individual  hosts  in  the 
traffic  (i.e.  they  cannot  tell  one  computer  from  another  by  looking  at  the  remaining  flow 
record  traffic  alone). 

In  such  an  environment,  what  is  required  is  a  method  of  classification  that  relies  on 
minimal  information  and  the  development  of  traffic  flow  behaviour  models  that  use  only 
this  information. 


One  Strategy  for  Comparing  A  Suspicious  Host  to 
a  Standard  Workstation  Using  Blind  Flow  Analysis 


Local  Baseline  Workstation  Behaviour  (BWB)  Suspicious  Host 


Bytes  Transferred  in  one  month  <  20  million  per  month 


45  billion  per  month 


Internal  DIPs  <10  per  month 
External  DIPs  <  20  per  month 


3  per  month 

1.74  million  per  month 


Protocols: 

1 

< 

2% 

1  1 

% 

6 

> 

70% 

6  9 

% 

17 

< 

30% 

17  90 

% 

Number  of  Protocols  < 

5 

3 

Port  Number 

#  of  Ports 

%of  Ports 

%of  Total  Bytes 

#  of  Ports 

%of  Ports 

%of  Total  Bytes 

Range 

Accessed 

Accessed 

T  raffic 

Accessed 

Accessed 

T  raffic 

<1024 

<7 

20-50% 

<1% 

45 

0.07% 

1024-5000 

<  10 

>30% 

>90% 

3,976 

6% 

1% 

>5000 

<5 

<20% 

<9% 

60,059 

93% 

99% 

Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


•  In  early  2006  Neural  Network  was  used  to  classify 
workstation  traffic  based  on  a  localized  “Workstation 
Genome”. 

•  It  was  found  workstation  behaviour  could  be  fully 
described  by  a  set  of  23  unique  3-tuples  formed  by  the 
combination  of  Protocol,  Destination  Port,  and  Byte 
Range  ID  -  Where  Byte  Range  ID  was  one  of  five  levels 
given  by: 


Bytes 
0-100 
100-999 
1000-9,999 
10,000-49,999 
50,000  + 


2 

3 

4 

5 


Range 


Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


a 


50  Hidden 
Nodes 


Each  input  frequency  vector  contains  an  observed  frequency  for  each  3-tuple 
for  a  24  hour  period. 

Each  3-tuple  is  defined  as  Protocol,  Destination  Port,  Byte  Range. 

All  observed  Workstations  could  be  described  by  a  23  element  Vector. 


Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


Host  ID 

Day 

Output  Vector 

Classification 
(Hit/Mis  s/Unknown 

1  [010] 

1 

[0.04  0.86  0.08] 

HIT 

2 

[0.17  0.97  0.00] 

HIT 

3 

[0.10  0.91  0.021 

HIT 

4 

[0.09  0.95  0.011 

HIT 

2  [10  0] 

1 

[0.95  0.06  0.00] 

HIT 

2 

[0.96  0.04  0.001 

HIT 

3 

[0.95  0.06  0.001 

HIT 

4 

[0.95  0.07  0.001 

HIT 

3  [0  0  1] 

1 

[0.00  0.09  0.92] 

HIT 

2 

[0.00  0.00  0.991 

HIT 

3 

[0.00  0.12  0.921 

HIT 

4 

[0.00  0.00  0.991 

HIT 

100%  Success  rate  on  uniquely  classifying  a  small  sample  of  the  population 


Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


•  In  early  2007  a  similar  population  of  workstations  was  chosen  with 
the  goal  of  testing  a  Support  Vector  Machine  approach  to 
classification. 

•  To  the  great  surprise  of  the  author,  the  number  of  unique 
3-tuples  required  to  uniquely  describe  the  Workstation 
Genome  had  risen  from  23  to  over  600  in  16  months. 


•  Subsequent  investigation  showed  that  the  diversity  of  the  observed 
behaviour  increased  as  a  function  of  both  population  size  as  well  as 
the  length  of  the  sampling  period. 


Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 

Percentage  of  Unique  Genes  as  a  function  of  the  number  of  Flow 

Records 
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Number  of  Flow  Records 


By  limiting  the  traffic  to  ICMP  and  TCP  flow  records,  the  number  of  unique  tuples  required  to 
adequately  describe  the  population  reached  a  steady  state  of  approximately  18%  of  the  total 
number  of  all  expressed  tuples. 

When  UDP  traffic  was  introduced  into  the  sample,  the  percentage  of  unique  tuples  in  the 
population  did  not  reach  a  steady  state  in  proportionality  but  rather  the  number  of  the  unique 
tuples  increased  in  linear  proportion  to  the  number  of  total  tuples  observed. 


Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


•  What  happened  to  the  network  traffic  to  create  such  diversity  in  such 
a  short  period  of  time? 

•  It  is  not  yet  possible  to  accurately  comment  on  the  nature  of  the 
change  in  traffic  volume. 

•  Two  fundamental  behaviours  changed. 

-  Protocol  Ratio 

•  From  TCP  70%  UDP  30% 

•  To  TCP  50%  UDP  50% 

-  Use  of  Unique  Destination  Ports  by  Workstations  now  parallels  Server 
behaviour. 


One  Year  of  Peer-to-Peer 


Much  has  been  written  lately  of  the  growth  and  deployment  of  Peer-to-Peer 
Protocols 

Recommended  reading  “Transport  Layer  Identification  of  P2P  Traffic ”, 
Thomas  Karagiannis,  et  al,  IMC’  04,  2004,  Taorimina,  Italy. 

Perhaps  Peer-to-Peer  is  the  culprit. 

Decided  to  check  for  the  presence  of  known  P2P  in  the  traffic 

eDonkey2000 

Fasttrack 

Bittorent 

Gnutella 

MP2P 


One  Year  of  Peer-to-Peer 


Protocol  Flows  By  Month  (nw) 


Month 


The  graph  above  shows  the  pattern  of  flows  by  protocol  for  one  year  for  the 
Target  network. 


One  Year  of  Peer-to-Peer 


TCP  Bytes  Per  Month  (nw) 


-♦—TCP  Bytes 


Month 


One  Year  of  Peer-to-Peer 


Destination  IPS  per  Month 


DIP'S  per  month 


Months 


For  a  small  network  they  talked  to  quite  a  few  friends. 


One  Year  of  Peer-to-Peer 


SIP's  per  month 


-♦—SIP's  per  month 


Months 


The  feeling  was  mutual. 


One  Year  of  Peer-to-Peer 


Let’s  consider  the  traffic  contribution  for  each  P2P  Application  in  the  table. 


One  Year  of  Peer-to-Peer 


MP2P,  or  Manolito,  is  a  P2P  system  primarily 
used  to  share  music  files.  MP2P  traffic  was  the 
least  contributor  to  the  overall  network  traffic 
among  the  observed  systems.  This  traffic 
reached  a  peak  flow  count  of  just  under  160  in 
January  2007. 


One  Year  of  Peer-to-Peer 


The  Fasttrack  P2P  system  is  primarily  used  by 
Kazaa  and  its  variants  to  exchange  mp3  music 
files.  Fasttrack  traffic  reached  a  peak  flow 
count  of  2,500  in  July  2006. 


One  Year  of  Peer-to-Peer 


EDonkey2000  was  a  peer-to-peer 
system  primarily  used  to  distribute 
large  images,  video  games  and 
software.  Although  officially 
discontinued  in  September  2005  due 
to  legal  action  brought  by  the 
Recording  Industry  Association  of 
America  (RIAA),  we  speculate,  based 
on  our  profiling,  that  we  observed 
eDonkey2000  communication  during 
2006.  EDonkey  traffic  passed  25,000 
flows  in  July  2006. 


One  Year  of  Peer-to-Peer 


Fasttrack 


Gnutella  Flows 


Gnutella  is  a  multi-tier  Peer  based  file 
exchange  system.  Traffic  from 
Gnutella  ranged  from  5,000  to  35,000 
flows  per  month. 
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One  Year  of  Peer-to-Peer 
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BitTorrent  is  an  ever  increasing  popular  P2P  system  used  for  exchanging  large 
data  files.  Many  open  source  software  releases  are  distributed  using  BitTorrent. 
It  is  also  used  to  distribute  legal  movie  and  music  downloads.  BitTorrent  traffic 
eclipsed  most  P2P  traffic  at  300,000  flows. 


Flows 


One  Year  of  Peer-to-Peer 
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One  Year  of  Peer-to-Peer 


Unfortunately  the  overall  Peer-to-Peer  flow  pattern  did  not  match 
the  pattern  that  we  were  seeking.  That  being  a  50/50  ratio  of  TCP 
to  UDP. 


Protocol  Flows  By  Month  (nw) 


Month 


One  Year  of  Peer-to-Peer 


The  graph  above  shows  the  pattern  for  which  we  were  searching.  This 
is  the  traffic  from  a  single  user  workstation,  with  a  peak  flow  count  of 
50,000  flows  per  month. 


One  Year  of  Peer-to-Peer 


One  Year  of  Peer-to-Peer 


This  workstation  changed  its  behaviour  in  late  fall  2006  from  talking  to 
less  than  100  DIPs  per  month  to  6,000  DIPs  per  month. 


One  Year  of  Peer-to-Peer 


Flows  by  Protocol 


Months 


—■—TCP 
—a—  UDP 


Who  am  I  ? 


One  Year  of  Peer-to-Peer 


SKYPE 


This  traffic  pattern  is  driven  by  the  adoption  of  Voip  by  a  single  user  in  the 
target  network. 


Disclaimer:  It  is  important  to  point  out  that  since  the  experimenter  had  no 
access  to  the  actual  machine  or  payload  data  this  conclusion  is  simply 
conjecture  based  on  known  user  Behaviour  within  the  target  network. 
(Skype  is  a  wonderful  App) 


Observations  on  Traffic  for 
Clients  and  Peers 


•  Consumes  considerable  Resources. 

•  Represents  an  Application  Level  WAN  Network 
for  Communication. 

•  Provides  a  channel  to  hide  Malicious  Activity. 


“ McAfee  suggested  hackers  were  likely  to  create  malicious  software  to  target 
instant  messaging  services,  Voice  over  Internet  Protocol  (VoIP)  telephony 
services  and  online  gaming  sites.”  Hackers  will  target  social  networking  sites: 
security  firms  -  Thursday,  November29,  2007,  CBC  News  http://www.cbc.ca 


Evidence  that  all  is  not  as  it 

Appears 


•  One  day  in  February  a  conversation  took 
place  between  a  user  host  on  the  Network 
and  a  host  compromised  by  an  on-line 
game  server. 

•  Two  hours  later  the  user  host  was 
attempting  to  contact  a  few  friends.... 
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We  Need  to  Re-Consider  our 
Willingness  to  be  a  Peer 


•  Users  willingly  download  and  install 
client/peer/server  software. 

•  They  even  participate  in  strategies  to  avoid 
barriers  and  impediments  (like  Nat’ing). 

•  There  is  an  implied  trust  that  the  communication 
is  exclusively  what  it  claims  to  be. 

•  “When  they  thought  they  were  playing  at  war 
craft,  they  were  actually  playing  at  war  craft.” 


Concluding  Notes 


•  The  network  is  evolving  at  the  edges 

•  This  means  that  network  architectures, 
management  and  provisioning  strategies 
are  now  more  responsive  then  ever. 

•  Global  communication  resources  are 
primarily  influenced  by  the  uncoordinated 
activities  of  individuals. 

•  Traffic  patterns  are  emergent  properties 
without  intent. 


Future  Work 


•  Study  the  growth  in  diversity  of  patterns  in  traffic. 

•  Study  the  form  and  distribution  of  applications  and  participants. 

•  Track  Unidentified  Anomalies. 

•  February  2008,  TARA  will  announce  the  InTARA  project 

Intelligent  Network  Traffic  Analyzers  for  Reconstructive  and  Real  Time  Analysis 

•  InTARA  will  be  a  multi-million  dollar,  multi-year  project  to  develop 
intelligent  traffic  analysis  capabilities  for  the  good  guys. 

•  We  are  seeking  global  collaborative  research  and  commercialization 
partners.  Early  stage  interest  from  Australia,  India,  Switzerland, 
Canada. 


Identifying  Anomalous  Traffic 
Using  Delta  Traffic 

Tsuvoshi  KONDOH  and  Keisuke  ISHIBASHI 

Information  Sharing  Platform  Labs. 

NTT 


Flocon2008,  January  7-10,  2008,  Savannah  GA 


Outline 

•  Background  and  Motivation 

-  Identifying  anomalous  traffic  is  the  missing  piece. 

•  Our  Technique:  DELTAA 

-  Concepts 

1 .  Extract  anomalous  traffic  as  the  delta  of  normal  and  anomalous 
time  periods. 

2.  Auto-aggregate  extracted  anomalous  traffic. 

-  Operation  of  our  technique 

•  How  to  implement  the  above  concepts. 

•  Evaluation 

-  Evaluation  using  synthesized  DDoS  traffic. 

•  Summary 
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Background  and  Motivation  Time  series  of  total  traffic  by  bps 

A  bps 

•  Monitoring  of  traffic  volumes  is  widely  m 
used  for  network  operation  (e.g.  MRTG). 

•  Many  techniques  for  detecting 
anomalous  volume  change  have  been 
proposed  (NBAD,  Holt-winters  in  MRTG, 

. . .  etc.). 

•  Some  tools  to  mitigate  damage  from 
anomalous  traffic,  (e.g.  drop/rate  limit  at 
router,  detour  to  Cisco  Guard,  etc.) 


Auto-detect 


However,  accurate  mitigation  needs 
accurate  ACL  sets. 

Generating  accurate  ACL  sets  requires 
manual  drill  down  by  operator. 

-  Too  costly. 


Manual  drill  down  of 
anomalous  traffic 

Time  series  of  protocol  composition 

A 

bps 


time 

Time  series  of  dst  port  composition 


Our  Technique:  DELTAA 

•  DELTAA  outputs  ACL  sets  for  filtering  or  rate  limiting  to 
mitigate  the  damage  from  anomalous  traffic. 

-  DELTAA:  Delta  Traffic  Automatic  Aggregator  Today,  I  will  ^ 

focus  on  two 

•  Three  concepts  of  DELTAA:  ^concepts  J 

1 .  Reveal  anomalous  traffic  using  delta  traffic. 

2.  Aggregate  delta  traffic  and  generate  optimized  ACL  sets  on  a 

single  dimension  (e.g.  source  IP  address  dimension). _ 

3.  Generate  multi-dimensional  ACL  sets  by  integrating  each 
dimensional  anomalous  traffic  range. 
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Concept  #1 : 

(1)  Definition  of  “Normal”  and  “Anomalous”  Traffic 

Throughout  this  presentation,  I  use  the  following  definitions. 

•  Anomalous  traffic:  Traffic  that  causes  a  change  in  traffic 
volume  (bps/pps/fps). 

-  BitTorrent  and  server  intrusion  are  out  of  scope  because  they 
always  exist  or  do  not  cause  a  volume  change. 

•  Normal  period:  Period  when  traffic  volume  is  normal. 

•  Anomalous  period:  Period  when  traffic  volume  is  anomalous. 


bpsA 


<-  Anomalous  traffic 


500 


<-  Normal  traffic 


Not  stuck  for  the 


meantime. 

It  looks  like  a  signature 
of  normal  traffic. 


Stuck! 


Concept  #1  : 

(2)  Reveal  Anomalous  Traffic 


•  Make  two  assumptions 

1 .  traffic  of  normal  period  =  normal  traffic 

2.  traffic  of  anomalous  period  =  normal  traffic  +  anomalous  traffic 


We  can  then  extract  anomalous  traffic  as  the  delta  of  the 
above  two  periods. 

anomalous  traffic  =  traffic  of  anomalous  period  -  traffic  of  normal  period 

bpsf 

<-  Anomalous  traffic 
Normal  traffic 

T7  'v  7  time 
normal  anomalous  normal  anomalous 
period  period  period  traffic 


Extracting  anomalous  traffic 
from  “traffic  of  anomalous  period 
is  difficult  because  it  is  a  mixture 
of  normal  and  anomalous  traffic. 


r 


I 


/Ta 
\  “tr 


Taking  the  delta  between 
traffic  of  normal  period” 
and  that  of  anomalous 
period,  we  can  effectively 
extract  anomalous  traffic. 


Concept  #2: 

Auto-aqqreqate  Delta  Traffic 


•  In  aggregation,  optimize  a  trade-off  (false  negative,  false 
positive,  number  of  ACLs)  by  using  the  best  range- 
selection  algorithm. 


Aggregation  example:  Aggregate  from  distinct  source  IP 
addresses  to  address  range  sets. 


Example  1 

ACL  sets  for  covering 

all  anomalous  traffic 
normal  traffic  anomalous  traffic 
src_ip  O.O.O.O^ 


ACL(1)  will 
filter  out 
normal  traffic, 
as  a  false 
positive.  ^ 


i 

ACL(1)  | 

\ 


ACL(2) 


/ 

/ 

ircip 


Better  range  selection 

f  *  Example  2  \ 

Splitting  ACL  range  \ 

to  avoid  collateral  damage  % 
normal  traffic  anomalous  traffic  ' 

.0.0.0  ◄ - r- - ► 

ACL(1) 


Explanation  of  Our  Technique 

•  Our  technique  can  generate  multi-dimensional  ACL  sets. 

-  e.g.  source/destination  IP  address,  source/destination  port, 
protocol,  flow  exporter,  and  router  interface 

-  Multiple  dimensions  do  not  mean  independent  of  above 
information  sets. 

-  Our  technique  merges  above  information  to  make  multi¬ 
dimensional  ACL  sets. 

•  In  this  presentation,  I  focus  on  source  IP  dimension 

identification  as  an  example  and  explain  step  by  step. 
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Stepl:  (1)  Counting  Up 

Count  normal  and  anomalous  periods  of  traffic  for  each  source  IP  address. 


normal  period  =600  Mbps 
src_ip  0.0. 0.0 


Count 
traffic 
volume  for 
each 
source  IP 
address. 


255.255.255.255 


9 


Step  1:  (2)  Making  Delta  Traffic 

Make  delta  traffic  by  subtracting  traffic  of  normal  period  from  that  of 
anomalous  period. 


500  M 


DELTAA  obtains 
anomalous  traffic  with 
granularity  of  source 
address  as  delta  traffic. 


normal  period  =600  Mbps 
src_ip  0.0.0.0 


255.255.255.255 


anomalous  period  =1  Gbps 
bps 


Subtract  for 
each  source  IP 
address. 


Anomalous  traffic 
=400  Mbps 
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Step  2:  (1)  Building  Tree  of  Normal  and  Anomalous 
Traffic 


•  Example:  When  we  use  only  anomalous  traffic  information, 
collateral  damage  cannot  be  avoided. 

-  Causes  mis-filtering  of  normal  traffic. 


So,  build  a  traffic  tree  using  both  normal  and  anomalous 

traffic.  Don’t  care 


normal 

traffic 


collateral  damage 
to  normal  traffic 


ACL(1) 


ACL(2) 
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Step  2:  (2)  Building  Tree  of  Normal  and  Anomalous 
Traffic 

•  Traffic  tree  making 

-  Build  up  from  individual  source  IP  addresses  (depth=32). 

-  Each  node  has  information  about  coverage  and  collateral  ratio. 

•  Collateral  ratio:  normal  traffic  of  the  node  -r-  total  normal  traffic 

•  Coverage  ratio:  anomalous  traffic  of  the  node  h-  total  anomalous  traffic 

-  Make  parent  nodes  by  merging  child  node  information. 


normal 

traffic 


anomalous 

traffic 

-► 


depth=32 
divided  into 


distinct  src_ip 

r  i  v  *\ 


depth=4  depth=3  depth=2  depth=1 
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128.0.0.0/1 
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Collateral  ratio 
JSum  is  100%) 


Coverage  ratio 
JSum  is  100%), 


prefix/length 

coverage/collateral 


depth=0 
non  divided 


0.0.0.010 

100/100 
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Step  3:  Selecting  Best  Node  Sets  (ACL  sets) 


1 .  To  reduce  search  space,  delete  unnecessary  nodes. 

-  Unnecessary  node:  node  having  little  coverage  ratio  (little  anomalous  traffic) 
or  little  difference  from  its  descendant  nodes. 

2.  Search  for  best  node  sets  by  evaluating  goodness  of  every  node  combination. 

-  Best  node  combination  =  Best  ACL  sets  for  source  IP  dimension 
•  But,  how  to  decide  goodness  of  the  node  sets? 


normal 

traffic 


anomalous 

traffic 

► 


depth=32 
divided  into 
distinct  src_ip 


ACL(1)=0.0.0.0/3 
cov=12.5,  col=5 


I 


ACL(2)=64.0.0.0/2 


cov=25.5,  co 


depth=4 
divided  16 


depth=3  depth=2 
divided  5  divided  4 


depth=1 
divided  2 


1  ACL(3)=1 92.0.0.0/3 

35/0  kj 

192.0.0.0/3 

cov=50,  col=5 

15/5  M 

50/5 

depth=0 
non  divided 


6/0 

> 

0.0.0.0/3 

r 

r  \ 

6/5 

12.5/5 

0.0.0.0/2 

6/15 

> 

32.0.0.0/3 

25/25 

\ 

6/5 

12.5/20 

V 

J 

\ 

0.0.0.0/1 

6/0 

> 

64.0.0.0/3 

r - 

- i 

50/30 

6/0 

12.5/0 

64.0.0.0/2 

/ 

6/5 

> 

96.0.0.0/3 

25/5 

6/0 

12.5/5 

' - 

) 

^ _ / 

0.0.0.010 

100/100 


Delete  unnecessary 
nodes  having  little 
^coverage  ratio. 


faaiBBb  4 .  ftmtujn^BiiiMftiri  /f  iltn u nlmfifrtiTiWfifrfliTfi? I rvii  frgTTjfii i >  ii  ni  n 

with  little  collateral 
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Criteria  of  “Goodness” 


•  Three  criteria  of  identification 

1 .  Coverage  ratio: 

Maximize  filtered  anomalous  traffic  =  (1  -  FNR) 

2.  Collateral  (damage)  ratio: 

Minimize  filtered  (normal)  legitimate  traffic  =  (FPR) 

3.  Number  of  ACLs: 

ACL  entry  budget  is  limited,  so  having  few  ACLs  is  better. 

•  But,  these  three  criteria  have  a  trade-off  relationship  with 
each  other. 


Dummy  graph:  Time  series  of  traffic  with  output  flowJDs  displayed  in  separate  colors 


Evaluation  Formula  for  Goodness 


To  evaluate  goodness  of  best  ACL  sets,  we  use  the  formula: 

coverage :  cov,  collateral  ratio  :  coll ,  no.  of  ACLs :  n 
(p  -a )  +a  •  cov  -p  •  coll 


rate  = 


TV 


(a  J3,y  •  weighting  coefficients) 


-  Weighting  coefficients  can  be  tuned  to  reflect  network  policy  or 
customer  requirements. 

ACL  sets  for  covering  all  anomalous  traffic  ^Example  ACL  splittihg^ 


rate=  2.61 


/ 


\ 


rate=  3.18 

cgverage=95%,  collaterals 0%,  no.  of  ACI^=3 

ACL(^ 
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Evaluation  and  Results:  Test  Data  Set 


•  Normal  traffic:  publicly  available  traffic  data  captured  on 
transpacific  line  (100  Mbps) 

•  Anomalous  traffic:  injected  synthesized  DDoS  attack  traffic 

-  Mimic  large  DDoS  attack 

•  We  choose  source/destination  addresses  that  have  large  normal 
traffic  because  simple  identification  would  cause  collateral. 

-  Destination:  Popular  server  appeared  in  normal  traffic 

-  Source:  Choose  IP  address  blocks  (/1 6)  from  which  volume  of 
normal  traffic  to  the  destination  is  largest. 

-  Port  numbers  and  protocol  of  attack  traffic  are  the  same  as 
those  of  normal  traffic. 


■  synthesized  attack  traffic 
□  norm  at  Qegitin  ate)  traffic 


Evaluation  and  Results:  Results  (1) 

•  Results:  We  get  four  ACL  sets  with  below  conditions 

-  coverage:  93.75% 

-  collateral:  0.00% 

-  no.  of  ACL  sets:  4 
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Evaluation  and  Results:  Results  (2)  OUTPUT 


basetime_len=  60.0  (sec) :  (1168362060.0  -  1168362120.0)  basic  information 

anomtime_len=  60.0  (sec) :  (1168362180.0  -  1168362240.0) 

base_total_bps=  89,121,539.5 

anom_total_bps=  137,729,812.7 

diff_total_bps=  48,608,273.2 

+54.5  % 

1-D  OUTPUT:  PROTOCOL=  6 

coverage=  100.42  collateral  95.52  single  dimension 

1-D  OUTPUT:  SRC  PORT=  high 

coverage=  108.27  collateral  33.42  identification 

1-D  OUTPUT:  DST  PORT=  high 

coverage=  100.09  collateral  96.40 

results 

1-D  OUTPUT:  SRCJP 

coverage=  96.43  collateral  0.00 

119.170.0.0/17 

coverage=  51.43  collateral  0.00 

119.170.128.0/18 

coverage=  25.72  collateral  0.00 

119.170.192.0/19 

coverage=  12.86  collateral  0.00 

119.170.240.0/20 

coverage=  6.43  collateral  0.00 

1-D  OUTPUT:  DSTJP 

coverage=  102.93  collateral  2.17 

134.45.182.70/32 

coverage=  102.93  collateral  2.17 

MULTI-DIMENSION_FLOW_OUTPUT  coverage=  96.43  collateral  0.00 

flowlD_0:  cov=  51 .43  col=  0.00: 

119.170.0.0/17  134.45.182.70/32  6  high 

high 

flowlD_1 :  cov=  25.72  col=  0.00: 

119.170.128.0/18  134.45.182.70/32  6  high 

high 

flowlD_2:  cov=  12.86  col=  0.00: 

119.170.192.0/19  134.45.182.70/32  6  high 

high 

flowlD_3:  cov=  6.43  col=  0.00: 

119.170.240.0/20  134.45.182.70/32  6  high 

high 

T  coverage  T  collateral:  T  srcjp  T  dstjp  protocl  scr_port  dst_port 

1 8 


Evaluation  and  Results  (3):  Destination  IP  Tree 


1-D  OUTPUT:  DST  IP 

coverage=  102.93  collateral 

=  2.17 

134.45.182.70/32 

coverage=  102.93 

collateral  2.17 

o.o.o.o/o 
100.00/  100.00 


128.0.0.0/1 
98.22  /  59.68 


48.0.0.0/4 

5.65/5.67 


Choose  this  node  for 
output  because  there  is  no 
significant  change  between 
it  and  upper  nodes  in 
coverage,  and  its  collateral 
ratio  is  smaller. 


Not  chosen  because 
coverage  is  slightly  less 
than  collateral  ratio.  If  this 
node  were  chosen,  the 
evaluation  rate  would  fall. 


128.0.0 

103.49 


134.45.182.70/32 

102.93/2.17 
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Evaluation  and  Results  (4):  Source  IP  Tree 


1-D 

OUTPUT:  SRC  IP 

coverage=  96.43  collateral 

=  0.00 

(1) 

119.170.0.0/17 

coverage=  51.43 

collateral  0.00 

(2) 

119.170.128.0/18 

coverage=  25.72 

collateral  0.00 

(3) 

119.170.192.0/19 

coverage=  12.86 

collateral  0.00 

(4) 

119.170.240.0/20 

coverage=  6.43 

collateral  0.00 

119.170.0.0/20 

119.170.16.0/20 

119.170.32.0/20 

119.170.48.0/20 

119.170.64.0/20 

6.43  /  0.00 

6.43  /  0.00 

6.43  /  0.00 

6.43  /  0.00 

6.43/0.00 

119.170.64.0/19 
12.86  /  0.00 

119.170.96.0/19 
12.86  /  0.00 

119.170.80.0/20 
6.43  /  0.00 

119.170.96.0/20 
6.43  /  0.00 

o.o.o.o/o 

100.00/100.00 


0.0.0.0/1 

106.49/14.26 


<-  0.0.0.0/0  (top  of  tree:  root  node) 
c-0.0.0.0/1 


119.170.0.0/16 
102.69  /  0.23 


<-1 19.170.0.0/16 


Whole  range  of 
synthesized 
anomalous  traffic 


119.170.112.0/20 
6.43  /  0.00 


119.170.128.0/20 
6.43  /  0.00 


119.170.128.0/19 

12.86/0.00 

119.170.160.0/19 
12.86  /  0.00 

119.170.144.0/20 
6.43  /  0.00 

119.170.160.0/20 

6.43/0.00 

1119.170.192.0/19 
12.86/0.00 

73: 


119.170.224.0/19 
12.68  /  0.23 


119.170.176.0/20 

119.170.192.0/20 

119.170.208.0/20 

6.43/0.00 

6.43/0.00 

6.43  /  0.00 

There  are  fewer  specific 
nodes  than  /21  because  their 
 coverage  is  less  than  5%. 


This  range  (/20)  includes  all  normal 
traffic.  If  you  choose  this  range, 
collateral  damage  will  occur. 
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Summary 


•  Revealed  three  criteria  of  optimal  ACL  sets. 

-  for  mitigating  DDoS  attacks  on  router 

•  Proposed  DELTAA  technique:  Optimizes  trade-off  among  the 
these  criteria,  using  normal  and  anomalous  traffic. 

•  Showed  effectiveness  of  DELTAA. 


-  Evaluation  results  using  prototype  and  synthesized  data  sets: 


•  coverage: 

•  collateral: 

•  no.  of  ACL  sets: 


93.75% 

0.00% 


180.0 

160.0 

140.0 

120.0 

100.0 

80.0 

60.0 

40.0 

20.0 

0.0 


I  False  Negative  traffic 

□  ACL  (4)  cover 
I  ACL  6)  cover 

□  ACL  G)  cover 

□  ACL  (L)  cover 
total  co  lateral' 


□  norm  al  Qegitin  ate)  traffic 


N  N  W  n  rf  Tf 


LQ  Q 

i— i  Cs] 


Tin  e 


Thank  you. 

Any  questions  are  welcome. 
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■  Introduction 

□  Why  do  we  need  a  large-scale  collection  system? 

□  What  is  Flow  Mediator? 

■  Requirements 

□  I  tried  to  explore  the  possibility  of  a  large-scale 
collection  system  for  large  networks. 

■  Heuristic  method  of  designing  traffic  collection 
system 

□  Estimate  number  of  flow  records  after 
aggregation  or  sampling 

□Adjust  several  parameters  based  on  this  result 

■  Summary 
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Introduction 


■  Traffic  volumes  in  ISP  networks  are  becoming 
huge  in  the  last  few  years. 

□  The  number  of  exported  flow  records  is  becoming  so  huge  that 
a  single  collector  cannot  handle  them. 

■  A  smaller  sampling  rate  makes  small  flows 
invisible. 

Even  if  traffic  grows,  network  operators  would  like  to  maintain 
the  same  sampling  rate  as  much  as  possible. 

■  Aggregated  flow  records  from  router  make  port 
number  or  IP  address  invisible. 

□  Exporting  5-tuple  flow  records  from  router  is  better. 


The  demand  for  a  large-scale  traffic-collection  system  is 
growing. 
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What  is  Flow  Mediator? 


■  Flow  Mediatort  is  a  device  that  "mediates"  flow 
records  and  has  the  following  functions: 

□  collects  Flow  Records  from  various  exporters 


□  stores  original  flow  records 

□  aggregates  flow 


Network 

designer 


Customer 

Service 


Network 

Operator 


t  draft-kobayashi-ipfix-mediator-model-01  .txt 
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You  can  easily  make  Flow  Mediation  code 

■  Net:: Flow  perl  module  is  available  on  CPAN. 

□  http://search.cpan.org/~akoba/Net-Flow-0.02/ 

□  The  module  can  encode  and  decode  NetFlow/l  PFI X  packets. 

□  The  encoding  and  decoding  functions  have  a  similar  IF. 
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Requirements 

■  Make  traffic-collection  system  to  meet 
following  requirements 

□  Requirement  1:  measure  traffic  flow  of  entire 
networks 

■  measure  traffic  matrices  PoP  by  PoP  and  router  by 
router 

□  Requirement  2:  store  received  5-tuple  flow 
records  from  router 

■  When  traffic  incident  happens,  allow  inspection  of 
traffic. 

□  Requirement  3:  design  scalable  architecture  to 
accommodate  large  ISP  traffic  volume 
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Goal 


■  Explore  heuristic  method  of  designing  collection 
system  for  introduction  into  actual  network. 

■  Proposed  collection  system  needs  to  accommodate 
following  network  model. 

□  Total  traffic  volume  500  Gb/s,  100  Mp/s 

■  Edge  Router  20/PoPx  10  PoP  =  200 

■  NetFlow  is  enabled  on  I  ngressl  F  of  Edge  router. 
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Hierarchical  Collection  System 

■  Mediators  are  allocated  in  each  PoP. 


□  They  store  all  flow  records,  aggregate  them, 
and  export  them  to  next  collector. 


Top  Collector 

□  measures  wide-area 
traffic  matrices,  such  as 
router  by  router,  pop  by 
pop. 

Inspection 

□  If  traffic  incident 
happens,  we  can  retrieve 
detailed  flow  records 
from  Flow  Mediator. 


Requirement  1 


Requirement  2 


10  PoPs,  20  routers/PoP,  Mediators  are  located  in  each  PoP. 

Core  Ec)ge  Edge  •  Observation  Point 
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Visualize  Traffic  Matrices 


■  Top  collector  can  visualize  Router/PoP/ AS  Traffic 
Matrixes. 


Nail  is  the  name  of  our 
traffic  matrix  visualizer. 


Color  indicates  traffic 
volume  of  Source/ 
Destination  pair. 


Nofl  ,nds* 

popmatrix  (mpls) 


MX  Top 


EX  Top 


We  can  select  all  traffic  or 
specific  VPN  (customer). 


Router 

IP  ver 

/VPN 

* 

ft 

ft 

ALL 

V 

IPv4+v6 

.ALL 

ft  . 

*** 

V 

.  1 

Jck 

Source 

PoP 


2006/10/13  10:  IS  -  2006/  Destinatjon  ^7  (bps) 

< -  PoP  - > 

im  2  mm  3  4  77NL  S  15*  other 


1 

111,948 

130,446 

67,036 

28,465  207,457 

2  mm 

98,221 

113,313 

59,129 

25,141  180,611 

3  31ES7 

115,853 

132,457 

69,394 

30j21?| 

4  77NL 

118,151 

139,980 

69,873- 

29,305  212,966 

S  J?t|\ 

58,000 

62,718 

36,616 

16,467  109,046 

other 

*  "  —  6 - -  1 - 
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Heuristic  Design  Method 

■  Suitable  values  of  several  parameters  are  decided 
by  the  following  steps. 

□  Step  0:  measure  performance  limit  of  flow  mediator  and 
top  collector. 

□  Step  1:  reveal  relation  between  number  of  flow  records 
and  packet  sampling 

□  Step  2:  reveal  relation  between  number  of  flow  records 
and  aggregation  that  depends  on  several  factors. 

■  Aggregation  methods  (BGP  Next-Hop,  Prefix,  host) 

■  Aggregation  interval  time  (20  s,  60  s,  90  s...) 

□  Step  3:  select  suitable  value  within  performance  limit. 

■  Large  sampling  rate  is  preferable. 

■  Small  granularity  of  aggregation  is  preferable. 
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Consideration  Points 


■  List  several  considerations,  as  follows. 

□  Maximum  performances  of  the  top  collector  and 
mediators  are  5  Kf/s  and  10  Kf/s. 


10  PoPs,  20  routers/PoP,  Mediators  are  located  in  each  PoP. 


^  Core  Edge  Edge  •  Observation  Point 
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Step  1:  estimate  flow  records  after  sampling 


■  Estimate  number  of  flow  records  based  on  density 
function  of  packets  per  flow|] 


□  #  of  packets  per  flow:  x 

□  Packets  per  flow  density  function:  F£k/7 

□  Sampling  rate:  1/r  § 

Total  number  of  unsampled  flow:  fan  I 

£ 

[/) 

c 

a> 

O 

Sampled  =Z(1-(1-1/r)X)xF(X)x  fall 

X=1 


F  (x)  =  0.5  x  x-1'73 

1.000000 
0.100000 
0.010000 
0.001000 
0.000100 
0.000010 
0.000001 

1  10  100  1000  10000  100000 
#  of  packet  per  flow 


Fxtrartinn  nru.i  75  Roughly  estimate  as  follows, 

probability  °'5X  100  Mpps  +  20  packets  =  5  Mf/s 


Approximate  #  of  flows  when  total  traffic  volume  is  500  Gb/s. 


Sampling  rate 

1/100 

1/1000 

1/10000 

f sampled 

305  kf/s 

43  kf/s 

5.2  kf/s 
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Too  many  flow  records  without  mediator 


■  Even  if  sampling  rate  is  1/10,000  packets,  the 
number  of  flow  records  exceeds  performance  limit. 


Sampling  rate 

1/100 

1/1000 

1/10000 

f sampled 

305  kf/s 

43  kf/s 

5.2  kf/s 

Max.  5Kf/s 


Sampling  rate 
=  1/10000 
packets 


10  PoP,  20  routers/PoP,  Mediators  are  located  in  each  PoP. 

Core  Edge  £~— 3  Edge  •  Observation  Point 
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Aggregation  Ratio  (=fe/fr) 


Step  2:  flow  records  after  aggregation 


■  What  is  the  #  of  flow  records  after  aggregation? 

■  Mediator  aggregates  unsampled  flow  records  at  20-second 
interval. 

□  Aggregation  efficiency:  Prefix  >  HOST  >  Pair  Prefix  >  Pair  HOST  > 
Bi-Flow 

■  The  prefix  length  724"  is  uniformly  applied  to  Prefix  Aggregation. 

■  Bi-flow  is  aggregated  from  two  flow  directions.  _ 


0.9 


♦  PAIR_HOST 
■ SRC_HOST 
ADST_HOST 

PAI  R_  PREFIX 
*SRC_PREFIX 

•  DST_PREFIX 
+  BIFLOW 


Top 

Collector 


0.5 

0.4 

0.3 

0.2 

0.1 

0 


0.8  _+^+^+HS-^H-PHkf+++++++++4^++4j^++4++^^ 


f 


0 


300 


Elapsed  Time(s) 


600 


900 
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Step  2:  Flow  records  after  aggregation,  sampling 


■  Each  aggregation 
method  becomes 
ineffective  gradually. 

■  Bi-flow  becomes 
ineffective 
immediately. 

□  sensitive  to 
sampling  rate. 


Sampling  rate  1/ 128 


i 

£  °'9 
&  0.8 

JL  0.7 

o  0.6 

2  0.5 

§  °'4 
15  °'3 
S'  0.2 

S  0.1 


Sampling  rate  1/ 1024 


-FLOW 

♦  PAIRHOST 

■  IP_SRC_ADDR 
aIP_DST_ADDR 
PAI  R_PREFIX 
xSRC_  PREFIX 

•  DST_PREFIX 
+  BIFLOW 


I 


<o 

a: 

c 

o 

+J 

to 

Q1 

£ 

U) 


0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 


♦  PAI  R_  HOST 

■  IP_SRC_ADDR 
-  IP_DST_ADDR 

PAIRPREFIX 

x  SRC_PREFIX 

•  DST_PREFIX 

+  BIFLOW 

•**«  *  **A  •  *  *  •*  Vi  • 

•  •  •  •  • 

i_ i 

300 


600 


900 


Elapsed  Time(s) 
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Step  2:  Which  factor  influences  aggregation? 

■  Aggregation  ratio  depends  on  several  factors. 

□Traffic  Volume  through  observation  point. 

□  Sampling  rate 
□Aggregation  interval  time 

I  guess  that  the  aggregation  ratio  depends  on  the 
number  of  flow  records  received  in  interval  time. 


Received  Flows 

3450 

3562 

Aggregation  Interval  Time  (s) 

10 

300 

Sampling  rate  (1/r) 

1 

128 

DST  HOST  Aggregation  ratio 

45% 

43% 

DST_PREFIX  Aggregation  ratio 

30% 

32% 
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Step  2:  Which  factor  influences  aggregation? 


■  I  plotted  all  experimental  data  into  one  graph. 

□  Three  MAWI  traffic  data  samples  have  different  volumes. 

□  Aggregation  I  nterval  time[]5  -  300s 

□  Sampling  rate[]l/l  -  1/1024 


10  100  1000  10000  100000  1000000 

#  of  flow  records 

Aggregation  ratio  depends  on  number  of  received  flow  records. 


ro 

cc 

c 

o 

4-> 

ro 

a> 

£ 
u I 


0.8 


0.6 


0.4 


0.2 


%  ♦ 


*■ 


:  y&b* 

>i * 
X 


■  t 


*$£* 


^  *3* 


♦  PAIR_HOST 
■  SRC_HOST 

*  DST_HOST 
PAI RPREFI X 

x  SRC_PREFIX 

•  DST_PREFIX 
BIPAIR 


** 
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Step  2:  Formulation  of  Aggregation  Ratio 


■  Aggregation  ratio  (R)  can  be  estimated  from 
number  of  flow  records  (fr),  as  follows. 

□  DST  Host  aggregation:  Rdsthost  =l.80x  fr-018 

□  DST  Prefix  aggregation:  prefix  =  2.34  x  fr~°'26 

■  After  all,  the  aggregation  ratio  depends  on  the  # 
of  unique  hosts  or  prefixes  versus  #  of  flows. 


log 


DST 

Hosts 


Aggregation  ratio  = 
DST  Hosts/Flows 


#  of  flow 
records 


log 


#  of  flow  log 

records 
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Step  3:  Selection  of  Suitable  Values 


■  I  selected  suitable  value  within  performance  limit. 


Sampling  Rate 

1/100 

1/1000 

1/10000 

#  of  received  flow 
records  in  top 

DST_HOST 

aggregation 

Interval 

Hmo  —  £flc 

45  kf/s 

9.0  kf/s 

1.6  kf/s 

II 1  lie  —  UUj 

collector 

(=Jfe) 

DST_  Prefix 
aggregation 

Interval 
time  =  60s 

21  kf/s 

4. 7  kf/s 

0.94  kf/s 

7.0  kf/s 

1.2  kf/s 

DST_HOST 

aggregation 

Interval 
time  =  300s 

34  kf/s 

DST_  Prefix 
aggregation 

Interval 
time  =  300s 

12  kf/s 

3. 0  kf/s 

0. 62  kf/s 

#  of  received  flow 
records  in 
mediator  (fr ) 


0 


30  kf/s 


4.4  kf/s 


0. 6  kf/s 
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Example  of  collection  system 


■  Sampling  Rate:  1/1000 

■  Aggregation  Interval  time:  60  s 


J  an  8,  2008 


T raffic  Matrix 
View 


□Max.  5  kf/s 


10  PoPs,  20  routers/PoP,  Mediators  are  located  in  each  PoP. 


^□3  Core  Edge 

Q  Edge 

NetFlow 
(J  observation 
Point 
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Conclusion 


■  To  make  large  scale  traffic  collection 
system,  flow  mediator  is  efficient. 

■  Revealed  relation  between  number  of  flow 
records  and  several  factors: 

□Traffic  volume 
□  Sampling  rate 
□Aggregation  method 
□Aggregation  interval  time 

■  Demonstrated  that  traffic  collection 
system  using  mediator  can  be  introduced 
into  actual  large-scale  networks. 
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Thank  you  for  your  attention. 


This  study  was  supported  by  the  Ministry  of  I  nternal  Affairs  and  Communications  of  J  apan. 
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Keisuke  Ishibashi 
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Outline 


■  Motivation 

□Approach  to  the  scalability  in  Large  NW 

□  What  is  Flow  Mediators? 

□  Introduce  Hierarchical  model 

■  Design  method  of  Collection  system 

□  Estimation  received  Flows  after  sampling 

□  Estimation  received  Flows  after  aggregation 

□  Results  of  reference  model 

■  Conclusion 
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Motivation 

■  Try  to  measure  wide-area  Traffic  Matrix  and  in- 
depth  inspection. 

□  Single  collector  cannot  accomplish  both  requirements. 

■  Especially,  difficult  to  maintain  the  scalability  of 
collection  system  in  large  NW. 

□  Number  of  exported  Flows  becomes  huge. 

■  100Gb/s  traffic  creates  approximately  50Kf/s  with  sampling 
1/1000. 

■  Adjusting  sampling  rate  could  cause  small  Flows  to  become 
invisible. 

■  Approach  to  the  scalability  by  using  Flow  Mediators 

□  To  make  Flow  collection  scalable,  efficient,  useful. 

Please  refer  to  our  draft  draft-kobayashi-ipfix-large-ps-OO.txt 
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What  is  Flow  Mediator? 


■  Flow  Mediator  is  a  system  that  "mediates"  Flow 
Records  and  has  the  following  functions: 

□  collects  Flow  Records  from  various  exporters 


□  stores  original  Flow  Records 

aggregates  Flow 

Records  flexibly 

□  distributes  appropriate 
Flow  Records  for 
dedicated 

collectors/analyzers 

To  reduce  the  number  of  Flows, 
we  focus  on  the  aggregation  function. 


.....  ,  .  Customer 

NW  designer  Service  NW  Operator 


Traffic  Matrix 

Accounting 

Troubleshoot 

measurement 

System 

System 

Please  refer  to  our  draft  draft-kobayashi-ipfix-mediator-model-01  .txt 
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You  can  feel  Flow  Mediation 


■  Net:: Flow  perl  module  is  available  on  CPAN. 

□  http://search.cpan.org/~akoba/Net-Flow-0.02/ 

The  module  can  decode  and  encode  NetFlow/l  PFI X  packets. 

□  The  decoding  and  encoding  functions  are  similar  IF. 


Approach  to  Hierarchical  Model 


Mediators 

store  the  whole  Flows,  aggregate  them  and  export  to  next  collector. 

Top  Collector 

□  measures  wide-area  Traffic  Matrices. 

If  traffic  incident  happen,  we  can  retrieve  the  detail  Flow 
data  from  Mediator. 


Nall 


Index 


MX  Top 

EX  Top 

popmatrix  (mpls) 

Router  IP  ver  /VPN 
| ALL  v  IPv4+v6  \ALL  v.@ 


We  can  select  the  all  traffic 
or  specific  VPN  (customer). 


2006/10/13  10:15  -  2006/  Destinatjon  *1^  (bps) 

-  PoP  - 


1  2  3  4  V-7FJL  5  **  other 


Source 

PoP 


1  *f®T 

111,948 

130,446 

67,036 

28,46  s 

2  mm 

98,221 

113,313 

59,129 

25,141 180,611 

3  SSiSS?? 

115,853 

132,457 

69,394 

4 '>7  hit- 

118,151. 

139,980 

69,873 

Pggsfgg  wK 

5  ** 

58,000 

62,718 

36,616 

16,467  109,046 

other 
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[^3  Core  Edge 

Edge 

NetFlow 
•  observation 
Point 


Wide-area  Traffic  Matrices 


■  Top  collector  can  visualize  the  Router/PoP/ AS 
Traffic  Matrixes. 


Na?l 


Index 


MX  Top 


EX  Top 


popmatrix  (mpls) 

Router  IP  ver  .‘VPN 


We  can  select  the  all  traffic 
or  specific  VPN  (customer). 


ALL 

IPv4+vG  A 

.ALL 

jfJK 

The  color  means  the 
traffic  volume  of  the 
Source/Destination  pair. 

A 

Source 

PoP 


2006/10/13  10:  IS  -  2006/  Destinatjon  to*  (bps) 

< -  PoP  - > 

1XW  2  mm  3  StiES?  4'>7hJP  5  other 


1 

111,948 

130,446 

67,036 

23,465  207,457 

2  mm 

98,221 

113,313 

59,129 

25,141  130,611 

3 

115,853 

132,457 

69,394 

30,217|lbcJt-y 

4 

118,151 

139,980 

69,373- 

29,30S  212,966 

5 

58,000 

62,718 

36,616 

16,467  109,046 

other 
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Approach  to  Design  for  Hierarchical  Model 


■  To  introduce  into  the  real  NW,  we  explored  the 
designing  method  using  Mediator. 

□  To  design  the  model,  we  need  to  estimate  #  of  Flows 
roughly. 

■  Estimate  #  of  the  exported  Flows  from  router  in 
sampled  flow. 

□  How  many  Flows  are  reduced  by  packet  sampling? 

■  Estimate  the  effect  for  the  aggregation. 

□  Aggregation  methods  (BGP  Next-Hop,  Prefix,  host) 

□  Aggregation  interval  time  (20s, 60s, 90s...) 


We  tried  to  estimate  #  of  the  received  Flow  and  aggregated 
Flows  based  on  the  MAWI  traffic  observed  international  GW. 
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Considerations  for  Design 


List  up  the  several  considerations,  as  follows. 

□  Maximum  performances  of  Top  Collector,  Mediators  are  5  Kf/s,  10  Kf/s, 

Try  to  explore  the  aggregation  granularity,  sampling  rate  to 
meet  the  each  performance  data. 


Jan  7,  200S 


w  •  -  • 

10  PoP,  20  routers/PoP,  Mediators  are  located  in  each  PoP. 

I  IVJV^UI  I  £-  \J  \J 


NetFlow 

observation 

Point 


Network  Scale  Model 


■  Total  traffic  volume  500  Gb/s,  100  Mp/s 

□  Edge  Router  20/PoPx  10  PoP=200 

□  Core  Edge  Router  2/PoPx  10  PoP  =20 

□  NetFlow  is  enabled  on  the  I  ngressl  F  of  Edge  router. 


Core  Router 


Core  Edge 
Router 


Edge 
Router 

Network  model 


NetFlow 

Observation  point  / 
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Estimate  Flows  after  Packet  Sampling 


■  Estimate  #  of  the  exported  flow  from  router  according  to 
the  density  function  of  packets  per  flow[| 


□  #  of  packet  per  flow:  x 

□  Packets/per  flow  density  function:  ?  ff. 0 

□  Sampling  rate:  1/r  § 

tj 

Total  number  of  unsampled  flow:  fan  s. 

l/) 


f sampled  =  X (X “ (X “ 1 7  f  )" )xF  <X)x  fall 

X=1 


F  (x) =0.5  x  x-1'73 

1.000000 
0.100000 
0.010000 
0.001000 
0.000100 
0.000010 
0.000001 

1  10  100  1000  10000  100000 
#  of  packet  per  fbw 


Extraction  n  Sx  -i.73  Roughly  estimate  as  follows, 
probability  '  lOOMpps  -s-  20packets=  5Mf/s 


Approximate  #  of  flow  in  case  of  total  traffic  volume  are  500  Gb/s. 


Sampling  rate 

1/100 

1/1000 

1/10000 

f sampled 

305  kf/s 

43  kf/s 

5.2  kf/s 
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Aggregation  Effect  in  non-sampled  NetFlow 


■  Aggregate  non-sampled  flows  at  20  second  interval. 

□  The  prefix  length  724"  is  uniformly  applied  to  Prefix  Aggregation. 

□  Bi-Flow  is  aggregated  from  both  direction  flows. 

□  Aggregation  Effect:  Prefix  >  HOST>  Pair  Prefix>  Pair  HOST>  Bi-Flow 


l 


O  0.8  +++  I  -H-+-++++.4-H.  |  +-H- 1  + 1  +++++++ 4^-H-++-t4++++  I 

-M  ^  _ 


0.9 


0.5 

0.4 


♦  PAIR_HOST 
■  SRC_HOST 
A  DST_HOST 

PAI  R_  PREFI X 
x  SRC_ PREFIX 

•  DST_PREFIX 
+  BIFLOW 


0.1 

0 


0 


300 


600 


900 


Elapsed  Time(s) 
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Aggregation  Effect  in  sampled  NetFlow 


■  Each  Aggregation 
method  become 
ineffective  gradually. 

■  Bi-Flow  becomes 
ineffective 
immediately. 

□  It  is  sensitive  to 
sampling  rate. 


a? 

S 

"-I— i 

ro 

8* 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 


Sampling  rate  1/ 1024 

“•v  .  *  *  ••••>*/•> 

• - • - “ - 


300  600 

Elapsed  Time(s) 


90 


-FLOW 

♦  PA  I  R_  HO  ST 

■  IP_SRC_ADDR 
aIP_DST_ADDR 
PAI  R_  PREFIX 
x  SRC_PREFIX 

•  DST_  PREFI X 
+  BIFLOW 


♦  PAI  R_  HOST 

■  IPSRCADDR 
IPDSTADDR 
PAIRPREFIX 
x  SRC  PREFIX 

•  DSTPREFIX 
+  BIFLOW 
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Which  factor  influences  aggregation? 


■  Aggregation  effect  depend  on  the  several  factor. 

□  Traffic  Volume  through  the  observation  point. 

□  Sampling  rate 
Aggregation  interval  time 


But,  roughly  and  simply  it  depends  on  #  of  the  received 
flow  between  aggregation  interval  time. 


Received  Flows 

3450 

3562 

Aggregation  Interval  Time  (s) 

10 

300 

Sampling  rate(l/r) 

1 

128 

DST  HOST  Aggregation  ratio 

45% 

43% 

DST  PREFIX  Aggregation  ratio 

30% 

32% 
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Aggregation  Effect  against  #  of  Received  Flows 


■  Whole  traffic  data  put  in  the  one  graph,  such  as 
follows. 

□  Traffic  data  from  3  samples  which  are  different  volume. 

□  Aggregation  I  nterval  time[]5s  □  300s 

□  Sampling  rate  01/1  G  1/1024 


♦  PAI  R_HOST 
■  SRC_HOST 
a  DST_HOST 

PAI  R_PREFIX 
x  SRC_PREFIX 

•  DST_PREFIX 
+  BIFLOW 


-  BIPAIR 


0 


10 


100 


1000 


10000  100000  1000000 
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Formulation  of  Aggregation  Effects 


■  Aggregation  ratio(R)  can  be  estimated  from  # 
(Wived)  of  received  flow,  as  follows. 

□  DST  Host  aggregation:  West  =  i-80x  freceived'018 

□  DST  Prefix  aggregation:  Rdstprefix  =2.34x  freceived'026 

■  After  all,  it  depends  on  #  of  unique  host  or  prefix 
against  #  of  flow. 


log 


DST 

Hosts 


Flows 


A 


Aggregation  ratio  = 
DST  Hosts/Flows 


Flows 


•  •  •  • 
— ► 
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Flow  rate  after  aggregationQinterval=lmQ 


SamplingRate 

1/100 

1/1000 

1/10000 

f sampled 

305  kf/s 

43  kf/s 

5.2  kf/s 

Received  Flow  rate 
per  Mediator 

30  kf/s 

4.4  kf/s 

0. 6  kf/s 

lm 

Interval 


Received  Flows  at 
interval  time(lm) 

918  k flows 

132  kflows 

18  kflows 

DST_  HOST 
aggregation 

R  =  1. 80  x  f total  -0J8 

ratio:  15% 

305x0.15 

=45  kf/s 

ratio:  21  % 

43  x  0.21 

=9. 0  kf/s 

ratio:  31% 

5.2  x  0.31 

=1.6  kf/s 

DST_  Prefix 
aggregation 

R  =  2.34  x  ftotat  -°'26 

ratio:  7% 

305 x  0. 07 

=21  kf/s 

ratio:  11% 

43  x  0.11 

=4. 7  kf/s 

ratio:  18% 

5.2  x  0.18 

=0.94  kf/s 
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Flow  rate  after  aggregationQinterval=5mQ 


SamplingRate 

1/100 

1/1000 

1/10000 

f sampled 

305  kf/s 

43  kf/s 

5.2  kf/s 

Received  Flow  rate 
per  Mediator 

30  kf/s 

4.4  kf/s 

0. 6  kf/s 

5m 

Interval 


Received  Flows  at 
interval  time(5m) 

4.6  Mfows 

660  kflows 

90  kflows 

DST_  HOST 
aggregation 

R  =  1. 80 X  f total  -0J8 

ratio:  11% 

305x0.11 

=34  kf/s 

ratio:  16% 

43  x  0.16 

=7.0  kf/s 

ratio:23% 

5.2  x  0.23 

=1.2  kf/s 

DST_  Prefix 
aggregation 

R  =  2.34  x  ftotat  -°'26 

ratio:  4% 

305 x  0. 04 

=12  kf/s 

ratio:  7% 

43  x  0. 07 

=3. 0  kf/s 

ratio:12% 

5.2  x  0.12 

=0. 62  kf/s 
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Design  of  Collection  System 


■  Sampling  Rate  :  1/1000 

■  Aggregation  Interval  time  :  60  s 
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□Maximum  limit:  5  kf/s 


10  PoP,  20  routers/PoP,  Mediators  are  located  in  each  PoP. 


Core  Edge 

Edge 

NetFlow 
*  observation 
Point 
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Conclusion 


■  Flow  Mediation  could  be  easily  introduced 
your  NW. 

■  To  maintain  the  scalability  for  traffic  grows, 
Flow  Mediation  is  efficient. 

■  To  utilize  furthermore  the  flexibility  of 
aggregation  and  sampling,  using  Flow 
Mediators  can  control  these  parameters. 

■  We  can  design  the  hierarchical  collection 
system  in  large-scale  NW. 
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Thank  you  for  your  attention. 
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On  Terabit  Flow  Analysis 

FloCon  2008,  Savannah 


Jonathan  M.  Smith 
CIS  Department ,  U.  Penn 


Terabit  Network  Applications 

•  Full-fidelity  remote  visualization  and  interactive 
simulation  for  80fps  HD  /  3D  HD  and  beyond, 
support  for  holographic  visualization 

•  High-speed  sensor  data  from  science  experiments 

•  Immersive  simulations  and  high-fidelity  massively 
multiplayer  virtual  worlds 

•  Receive  and  analyze  many  concurrent  high-fidelity 
streams  of  video  and/or  sensor  data  -  multiple 
uses  in  public  safety,  financial  services  and  other 
domains 


Challenges  for  Flow  Analysis? 

•  New  kinds  of  traffic: 

-  Extremely  High  Data  Rates 

-  Long  flows 

-  New  patterns  with  P2P  and  sensors 

•  Correlation  -  obtaining  the  “high  ground” 

-  Rare  events  vs.  attenuated  sampling? 

•  New  analysis  possible  with  DPI 

•  Goal:  ingest,  record  and  analyze  it  all! 


Tradespace:  data  rates  vs.  analysis 


The  “high 
ground”:  high 
aggregation 
plus  high 
data 

processing 

rates 


DSL/3G  wireless 
Consumer  FiOS 
Ethernet 


Traffic 


ing  ability  to 
relate  /  correlate 
in  real  time 


Decreasing  #bf  instructions 
per  byte/sec  oflthroughput 


n&vn 
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Data  rates,  log10  bits  per  second 
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12 


11 
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The  Terabit  Chokepoint 

Problem/Challenqe:  Network  chokepoint  (I/O 
and  memory)  between  fibers  and  CPUs  * 

Wt»M  Fibei - rir - CPU  Register 


(aggregate) 


SONET  5-year  “stretph”  goal 
(015768) 


File  (aggregate) 

* 

* 

LI  Oooho 


L2  cache 


Roister 


Next  desia/i  taraet 


DRAM 


product  general  on 

i'am...  uXami*-'* 


Upper  bound  on  networf  application  throughput 


Data  Path  from  Fiber-Optic  WAN  to  CPU 


n&vn 
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Data  rates,  log10  bits  per  second 
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12 
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Today’s  Single-Core  PC 
Performance  Measurements 


(Using  UBUNTU  Linux  “MEMTEST”  utility) 


WQM  Fiber 
(aggregate) 


LI  doohe 


L2  Cache:  lOOGbls  L2.cache 


Roister 


Next  design  targi 


DRAM 


DRAM:  16Gb/s 


Upper  bound  on  networT application  throughput 


p-s  Data  Path  from  Fiber-Optic  WAN  to  CPU 

1  Cllll 
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Challenge  of  Dense  Wavelength 
Division  Multiplexing  (DWDM) 

•  Fiber  bandwidth  is  serial  bit  rate 
multiplied  by  number  of  wavelengths 

•  E.g.,  128*40Gbps  in  deployed  systems 
(128  lambdas  of  OC768c  SONET) 


Processing  Must  Scale  with 

Fiber  Capacity 

Parallel  processing  seems  necessary 

Memory/processihgs^lements  to  track 
line  rates  and  numberotr/vavelengths? 


namn 
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Many-Core  CPU/GPU  Future 

Parallelism  floodgate  unleashed 

-  GPUs  and  CPUs  converging 

Teraflop+  performance  in  2009 

-  E.g.,  32  cores  @  2Ghz 

-  16-element  “short”  vectors  80-core  Intel  test  chip 

-  100  terabit/sec  aggregate  register  bandwidth 

-  1  terabit/sec  GDDR3  memory  bandwidth 

How  do  we  feed  it? 


Technical  Approach 


Serial  bitstreams  extracted 


•  Constraints:  pins,  power,  cost 

•  Switch-based  interconnects,  parallel  paths 


-  Direct  network/processor  interface? 


•  Stream/graphics  engines,  banked  memories 


-  Special  high-end  pool  of  DRAMs  for  NICs? 

New  software  structures  for  multicore 
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Components  looking  good  - 
architecture  needed 


•  1  TB  (8  Tbps)  memory  technologies 
announced.  Fiber  good  to  >10Tb/sec 

•  80-1000  cores  @1-10  Gbps  each 

•  Major  challenges:  fiber/electronic 
boundary,  data  distribution, 
interconnection  network  architectures 
(see,  e.g.,  Dally+Towles) 


Even  more  processing  to 
scale  with  fiber  capacity? 

•  Parallel  processing  at  both  multicore 
(perhaps  NPUs?)  and  “box”  level 

•  Cores  track  line  rates,  while  degree  of 
“box”  parallelism  matched  against 
grosser  units  of  wavelengths,  e.g.,  8: 


Advanced  Broadband  Intrusion 
Detection  Engine  (ABIDE) 


Malic 


Z _ _ X 

Z _ _ X 

z  :  z : \ 

□  □□□ 

□  □□□ 

□  □□□ 

□  □□□ 
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Help  architects  to  help  you 


•  Computer  architects  (see  Proc.  ISCA, 
Micro,  ASPLOS,  HPCA,  ...)  evaluate 
proposals  with  benchmarks 

•  Media  benchmarks  are  being  developed 

htt  p//eU  er.  slu.  edi/  ritt^meci  a  bench/ 


•  Flow  analysis  needs  benchmarks  for 
flow  analysis  tasks  -  input  side,  not  just 
netflow  outputs  (this  is  after  the  fact) 


Summary 

•  The  future  is  in  parallelism 

-  Dense  Wavelength  Division  Multiplexing  (DWDM) 

-  On-chip  networks  for  multicore 

-  Trees  for  “box”-scale  parallelism 

•  Huge  challenges  remain 

-  Software  for  new  parallelism  /  media  stream 
analysis;  topological  choices  (e.g.,  Batcher-Banyan 
+  Crossbar?);  load-balancing  algorithms 

•  Need  to  get  flow  analysis  workloads  on 
computer  architecture  radar 
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•  Immersive  simulations  and  high-fidelity  massively 
multiplayer  virtual  worlds 

•  Receive  and  analyze  many  concurrent  high-fidelity 
streams  of  video  and/or  sensor  data  -  multiple 
uses  in  public  safety,  financial  services  and  other 
domains 


Penn 
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CPU 

Terabit/second  _  Stream 

GPU 

Crossbar  Switch 

Highly  Banked 
Memory 

Challenges  for  Flow  Analysis? 

•  New  kinds  of  traffic: 

-  Extremely  High  Data  Rates 

-  Long  flows 

-  New  patterns  with  P2P  and  sensors 

•  Correlation  -  obtaining  the  “high  ground” 

-  Rare  events  vs.  attenuated  sampling? 

•  New  analysis  possible  with  DPI 

•  Goal:  ingest,  record  and  analyze  it  all! 
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Tradespace:  data  rates  vs.  analysis 


The  “high 
ground”:  high 
aggregation 
plus  high 
data 

processing 

rates 


DSL/3G  wireless 
Consumer  FiOS 
Ethernet 


Traffic 


ability  to 
ate  /  correlate 
in  real  time 


Decreasing  #|of  instructions 
r  byte/sec  oflthroughput 
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Terabit/second 
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Data  rates,  log10  bits  per  second 
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The  Terabit  Chokepoint 

Problem/Challenqe:  Network  chokepoint 
(I/O  and  memory)  between  fibers  and  CPUs 

Wt>M  Fitta - — r - CPU  Register 


(aggregate) 


SONET  5-year  “stretch”  goal 
(OC768) 


File  (aggregate) 

LI  Oaoho 


L2  Cache 


tn 


Roster 


DRAM 


in 


exr*puTS  product  generation 

i  An-.. w^MrC * 


Upper  bound  on  networT  application  throughput 


Data  Path  from  Fiber-Optic  WAN  to  CPU 
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Data  rates,  log10  bits  per  second 
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Today’s  Single-Core  PC 
Performance  Measurements 

(Using  UBUNTU  Linux  “MEMTEST”  utility) 


WDM  Fiber 
(aggregate) 


LI  Oooho 


L2  Cache:  lOOGbls 


Roster 

45®<V£0ts  on 

i'am-.u 


DRAM 


Upper  bound  on  networF  application  throughput 


^  Data  Path  from  Fiber-Optic  WAN  to  CPU 
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Challenge  of  Dense  Wavelength 
Division  Multiplexing  (DWDM) 

•  Fiber  bandwidth  is  serial  bit  rate 
multiplied  by  number  of  wavelengths 

•  E.g.,  128*40Gbps  in  deployed  systems 
(128  lambdas  of  OC768c  SONET) 
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Processing  Must  Scale  with 

Fiber  Capacity 

•  Parallel  processing  seems  necessary 

•  Memory/processihg^lements  to  track 
line  rates  and  numberbtwavelengths? 


Many-Core  CPU/GPU  Future 

Parallelism  floodgate  unleashed 

-  GPUs  and  CPUs  converging 

Teraflop+  performance  in  2009 

-  E.g.,  32  cores  @  2Ghz 

-  1 6-element  “short”  vectors  80-core  Intel  test  chip 

-  100  terabit/sec  aggregate  register  bandwidth 

-  1  terabit/sec  GDDR3  memory  bandwidth 

How  do  we  feed  it? 
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Technical  Approach 


Serial  bitstreams  extracted 


Multiplexer  (OADM) 


•  Constraints:  pins,  power,  cost 

•  Switch-based  interconnects,  parallel  paths 
-  Direct  network/processor  interface? 


•  Stream/graphics  engines,  banked  memories 


-  Special  high-end  pool  of  DRAMs  for  NICs? 
•  New  software  structures  for  multicore 

5?  Penn 

University  of  Pennsylvania 


Components  looking  good  - 
architecture  needed 


•  1  TB  (8  Tbps)  memory  technologies 
announced.  Fiber  good  to  >10Tb/sec 

•  80-1000  cores  @1-10  Gbps  each 

•  Major  challenges:  fiber/electronic 
boundary,  data  distribution, 
interconnection  network  architectures 
(see,  e.g.,  Dally+Towles) 


Penn 

University  of  Pennsylvania 


CPU 

Terabit/second  _  Stream 

GPU 

Crossbar  Switch 

Highly  Banked 
Memory 

Even  more  processing  to 
scale  with  fiber  capacity? 

•  Parallel  processing  at  both  multicore 
(perhaps  NPUs?)  and  “box”  level 

•  Cores  track  line  rates,  while  degree  of 
“box”  parallelism  matched  against 
grosser  units  of  wavelengths,  e.g.,  8: 


Advanced  Broadband  Intrusion 
Detection  Engine  (ABIDE) 
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University  of  Pennsylvania 


Help  architects  to  help  you 


•  Computer  architects  (see  Proc.  ISCA, 
Micro,  ASPLOS,  HPCA,  ...)  evaluate 
proposals  with  benchmarks 

•  Media  benchmarks  are  being  developed 

http://euler.slu. edu/~f ritts/mediabench/ 

•  Flow  analysis  needs  benchmarks  for 
flow  analysis  tasks  -  input  side,  not  just 
netflow  outputs  (this  is  after  the  fact) 
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Summary 

•  The  future  is  in  parallelism 

-  Dense  Wavelength  Division  Multiplexing  (DWDM) 

-  On-chip  networks  for  multicore 

-  Trees  for  “box”-scale  parallelism 

•  Huge  challenges  remain 

-  Software  for  new  parallelism  /  media  stream 
analysis;  topological  choices  (e.g.,  Batcher-Banyan 
+  Crossbar?);  load-balancing  algorithms 

•  Need  to  get  flow  analysis  workloads  on 
computer  architecture  radar 
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Abstract  of  this  presentation 

Ideas  for  increasing  (optimizing)  performances  of  processes  in 
IPFIX 


Ideas  based  on  all  processes  using  an  order  rule  of 
Information  Elements/fields 


r 


< 


These  ideas  are  introduced: 

Method  for  reducing  the  number  of  comparisons  between  an  existing 
flow  and  an  incoming  new  packet  in  Metering  Processes  (MPs) 
(Comparison  method  for  multiple  fields  in  MPs) 

Method  for  reducing  the  number  of  copies  of  flow  records  from 
Metering  Process  to  Exporting  Processes  (EPs)  with  a  predefined 
order  of  fields 

(Copy  method  for  multiple  fields  in  EPs) 

Method  for  increasing  processing  speed  for  storing  data  in  incoming 
packets  to  file  with  a  predefined  format  of  Collecting  Processes  (CPs) 
(Copy  method  for  multiple  fields  in  CPs) 


These  are  basically  the  same. 
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Motivation  of  this  research 
■  Background 

□  Network  bandwidth  will  continue  to  increase. 

□  IPFIX  will  be  a  standard  protocol  for  flow 
information  exchange. 

■  Network  bandwidth  will  become  broader-band. 


Use  a  lower  sampling  rate. 
□  Use  fewer  Flow  Keys. 


However,  flow  information 
will  become  less  accurate. 


) 


Research  on  increasing  (optimizing)  the 
performances  of  IPFIX  processes _ 
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IPFIX  features 

■  IPFIX 
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a  Advantage:  Uses  Template-based  flexible  flow  export 
|  Disadvantage:  More  complex  than  fixed-format  protocol 


Comparison  of  processes  between  flexible  and  fixed 
formats, _ 


NetFlow  v5 

(fixed 

format) 


MP  reorders  fields 
of  observed  packet 
to  Flow  Record 
arranged  in  NFv5 
format 


EP  inserts  NFv5 
header  and  sends  it 


Observed 
packets 


IPFIX 

(flexible 

format) 


CP  can  send  Flow 
Records  with 
normalized  format  to 
storage  by  removing 
NFv5  header 


Data  Records 


^ollectinc 

Process 


MP  reorders  fields 
of  observed  packet 
to  Flow  Record  of 
internal  format  in 
cache. 

The  format  depends 
on  implementation  of 
each  Exporter. 


EP  reorders  fields 
based  on 
configured 
Templates  and 
sends  it. 


If  CP  sends  Flow 
Records  with 
normalized  format 
to  storage,  it  must 
reorder  fields  of 
incoming  packets 


"^Storage 
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Our  approach:  Making  the  order  rule  for  Information  Elements 


■  Processes  of  IPFIX  have  a  high  possibility  of 
reordering  fields. 

□  Reducing  the  cost  of  reordering  fields  can  improve  their 

performance.  i — i 

■  Our  approach 

□  Make  the  order  rule  for  Information  Elements 


■  Order  rule  gives  IPFIX  processes  chances  to  process 
multiple  fields. 

■  Processing  multiple  fields  at  a  time  achieves  higher 
performance  than  processing  one  field  at  a  time. 

■  The  rule  does  not  influence  the  flexibility  of  IPFIX. 


If  a  unified  order  rule  of  fields/IEs  is  defined, 
reordering  costs  can  be  reduced. 
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Idea  of  order 
■  Idea  of  order: 
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□  MPs,  EPs  and  CPs  place  fields  (lEs)  in  the  same  order,  so  it 
is  highly  likely  that  multiple  fields  will  be  processed  at  a  time. 

■  This  reduces  reordering  costs. 

■  Order  recommended  in  this  presentation 

□  Place  fields  in  observed  packets  in  order  of  protocol  header. 

□  Therefore,  order  of  lEs  that  refer  to  packets  and 
header  fields  is  recommended. 


Metering  Processes 

Exporting  Processes 

Collecting  Processes 

Input 

Observed  packets 
(network  byte  order) 

Their  caches 

IPFIX  Data  Record 
(network  byte  order) 

Output 

(Storing)  their  caches 

IPFIX  Data  Record 
(network  byte  order) 

(Storing  files,  their  DB 
(real-time  analysis) 

A 
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Example  of  using  same  order  in  MP,  EP  and  CP 

Flow  Keys:  sourcelPv4Address,  destination  I  Pv4Address,  sourceTrans 


sourceT ransportPort,  destinationT ransportPort 


Good  (ideal)  case:  Same  suggested  order, 

which  refers  order  of  packet  header  fields  used  in  the  cache  in  Exporter  and  IPFIX  data  records 


v  IHL  TOS 

Total  Length 

ID 

F  Offset 

TTL  Drotoco 

IP  Checksum 

Source  IPv4  Address 


Destination  IPv4  Address 
Source  Port  Destination  Por 


Source  IPv4  Address 
Destination  IPv4  Address 
Source  Port  pestination  Po 


v  IHL  TOS 

Total  Length 

ID 

F  Offset 

TTL  Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  1 

Pv4  Address 

Source  Port 

Destination  Pori 

UDP  Length 

JDP  Checksum 

sou  rcelPv4Ad  dress 

sourcelPv4Address 

destinationlPv4Address 

!► 

destination  1  Pv4Add  ress 

sourceT  ransportPort  dstT  ransportPort 

sourceT  ransportPort  dstT  ransportPort 

Destination  IPv4  Address 

sou  rcelPv4Ad  dress  j  1 

1  dstT ransportPort 

sourceT  ransportPort 

Source  IPv4  Address 

so 

urceTransportPj  destination! 

destination  1  Pv4Add  ress 

Destination  Podj  Source  Port 

|!Pv4Au dress  I  dstfransportPort  | 

sourcelPv4Address 

Bad  case:  Different  order  used  in  the  cache  in  Exporter  and  IPFIX  data  records 


■  If  the  referential  order,  which  refers  to  the  order  of  packet  fields,  is 
defined,  it  could,  in  some  cases,  lead  to  increased  performance. 

■  If  a  referential  order  is  undefined,  there  is  no  possibility  of  increased 
performance. 


1st  idea  to  improve  performance 
in  environment  in  which  MP,  EP,  and  CP  use  the  same  order 


Comparison  method  for  multiple  fields  in 

Metering  Processes  (MPs) 


NTT  Network  Service 


ories,  NTT  Corporation 


NTT  Network  Serv 


aboratories,  NTT  Corporation 


Comparison  method  for  multiple  fields  in  MP  (1) 


■MP  must  repeat  comparison  between  existing  Flow  Records  in 
its  cache  and  new  observed  packet. 

cTo  judge  whether  the  new  packet  belongs  to  a  new  flow  or  an  existing  one. 

■Basically,  in  this  comparison,  all  fields  (lEs)  serving  as  Flow 
Keys  are  compared  every  time. 


■If  fields  of  Flow  Records  are  placed  in 
order  as  packet  header  fields,  MP  can 
multiple  fields  at  a  time _ 

Metering  Process 


the  same 
compare 


MP  repeats  comparisons 
and  finds  a  flow. 


NTT  Network  Servi 
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Comparison  method  for  multiple  fields  in  MP  (2) 
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Example:  Flow  Key:  Version,  IHL,  TOS,  source  Address,  destination  Address 
All  fields  are  compared  every  time  (general  approach) 


v  IlHLl  TOS 

Total  Length 

ID 

F  Offset 

TTL  Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Compare 


an  observed  packet 


A  Flow  Record  in  cache 


When  a  packet  arrives: 
5  comparisons 

1 .  ip  version 

2.  IHL 

3.  TOS 

4.  Source  Address 

5.  Destination  Address 


Multiple  field  comparison  (our  approach) 

Premise:  Fields  of  Flow  Records  are  placed  in  the  referring  order  as  packet  header  fields 


f  f 

ff 

0000 

0000 

0  000 

00 

00 

0000 

rrrrrrrr 

TTTTTTTT 

rrrrrrrr 

TTTTTTTT 

v  IHL 

TOS 

Any  value 

v  |IHL|  TOS 

0000 

Any  value 

Any  value 

0000 

0  000 

Any  Val 

Any  val 

Any  value 

o 

o 

o 

o 

0000 

Source  IPv4  Address 

Source  IPv4  Address 

Destination  IPv4  Address 

Destination  IPv4  Address 

When  Template  is  defined: 
Create  a  Mask 


Mask  created  when 
template  is  defined 


Observed  packet 


A  Flow  Record  in  cache 


v  IHL  TOS 

'  0000 

v  IHL  TOS 

0000 

00 

0  000 

0000 

0  000  | 

O 

o 

o 

o 

0000 

O 

O 

o 

o 

0000  1 

Source  IPv4  Address 

compare 

Source  IPv4  Address 

Destination  IPv4  Address 

Destination  IPv4  Address 

When  a  packet  arrives: 
Mask  the  packet 
And 

compare  these  memory 
areas  at  the  same  time 
(e.g.,  memcmp  in  C  language) 


Masked  observed  packet 


A  Flow  Record  in  cache 


Or 

1.  v  +  IHL  +  TOS 

2.  Source  Address 

3.  Destination  Address 
(32-bit  architecture) 


NTT  Network  Serv 
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Comparison  method  for  multiple  fields  in  MP  (3) 


■  Number  of  operations  in  this  method 

□  Mask  costs  smaller  than  comparison  costs. 

□  Therefore,  this  method  is  effective  at  increasing  performance  by  reducing 
the  number  of  comparisons,  although  it  increases  mask  operations. 


Mask  creation 

Mask 

Comparison 

Number  of 
operations 

Once  in  an  IPFIX  session 
(when  Template  is  defined) 

Depends  on  the  number 
of  observed  packets 

(when  packet  arrives) 

Depends  on  the  number  of 
observed  packets  and 
number  of  flow  records  in 
cache 

less  more 


■  Effective  and  ineffective  cases 


v  IHL  TOS 

Total  Length 

ID 

F  Offset 

TTL  Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Effective  case: 

Flow  Keys  are  placed  densely 


v  IHL  TOS 

Total  Length 

ID 

F  Offset 

TTL  Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Ineffective  case: 

Flow  Keys  are  placed  sparsely. 
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2nd  idea  to  improve  performance 
in  environment  in  which  MP,  EP,  and  CP  use  the  same  order 

Copy  method  for  multiple  fields  in 

Exporting  Processes  (EPs) 
_  and  Collecting  Processes  (CPs) 

Laboratories,  NTT  Corporation 


'NTT  Network  Servi 


NTT  Network  Servi 


_ tories,  NTT  Corporation 

Overview  of  copy  method  for  multiple  fields 
It  is  a  very  simple  method. 

□  If  fields  in  the  format  of  cache  and  lEs  in  exporting  Data 
Records  are  placed  in  the  same  order,  ERsJjave  a  chance 
to  copy  multiple  adjacent^yTstant-fixed^length^lEs  at  a  time, 

□  If  lEs  in  received  Data  Records  and  fields  in  Collectors’ 
internal  format  to  store  Flow  Recprds  are  placed  in  the 
same  order.  CPs  have  a  chanc^to  copy  multiple  adjacent 
constant  fixedTengjjriEs  at  a  time  too. 


IE  size  classificationof  IPFIX  (terminology  in  this  presentation) 

Protocol  specification  In  this  presentation 

Constant-fixed-length  IE 
(e.g.,  IP  Address) 


Fixed-length  IE 


Reduced-size-encoding  applicable  IE 
(e.g.,  counters) _ 


Variable-length  IE 

Variable-length  IE 

(octet  array,  strings) 

w 

(octet  array,  strings) 

NTT  Network  Serv 
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Example  of  copy  method  for  multiple  fields  in  EP 


■  Conditions  for  copying  multiple  fields 

□  Flow  Record  in  cache  and  Exporting  Data  Record  must  use  the  same 
order. 


IEs  must  have  a  constant  fixed  length. 

■  Almost  all  IE  characterizing  properties  of  flow  are  constant  fixed  length. 

□  Byte-orders  must  be  the  same. 

■  Observed  packet  and  Exporting  Data  Records  use  network  byte  order. 

□  IEs  for  copying  multiple  fields  must  be  adjacent. 


Flow  Record  in  cache 
Characteristic  properties  of  flow 


V 


IHL I  TOS 


ID 


TTL  I  Protocol 


Jp3taU=eft€tth~~ 
F  I  Offset 


IP  Checksum 


Source  IPv4  Address 


Destination  IPv4  Address 


Source  Port  I  Destination  Port - f  j  ©  I  d  S  at  3  time 


UDP  Length 


UDP  Checksum 


Measured  properties  of  flow 


Flow  Start  Absolute  Time 


Flow  End  Absolute  Time 


Flow  Octet  Count 


Copy  one 


at  a  time 


Copy  multiple 


Adjusllength 
Convert-byte  order 
, Calculate  time 
etc.:'/ 


Exporting  Data  Record 


ioClass(5)  Drotocol(41 


source 


Pv4Address(8) 


Pv4Address(12'i 


destinatior 


srcv4PrefLen 


dstv4PrefLen 


srcTransgotP^rt(7]JdstTmns£Oi1PortQ1 


flowStartSysUpTlme(22) 


flowEndSysUpTime(21 ) 


octetDeltaCount(l) 


packetDeltaCount(2) 


Constant-fixed-length  IE 


Reduced-size-encodinq-applicable  IE 
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Evaluation  &  Conclusion 
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This  material  contains  an  evaluation  about  only  comparison  method. 

If  you  want  to  see  an  evaluation  about  copy  method,  please  see  a  material  I 
talked  in  past  IETF,  http://www3.ietf.org/proceedinas/07iul/slides/ipfix-10.pdf. 
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Evaluation  of  comparison  method  for  multiple  fields 


30 


Processing  T  in  e  □  smgie  Mm  ultpje] 

^3§%|aster 


Almost  the  same  27%  faster 


P+S A+D  A+SP+D  P 


TTL  + 

P+S A+D  A+SP+D  P 


V  +  1HL  +  T0  S+TTL  + 
P+SA+D A+SP+D P 


Fbw  Keys  and  Comparison  method 


V 

IHL 

TOS 

Total  Length 

ID 

F 

Offset 

TTL 

Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Source  Port 

Dst  Port 

V 

IHL 

TOS 

Total  Length 

ID 

F 

Offset 

TTL 

Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Source  Port 

Dst  Port 

V 

IHL 

TOS 

Total  Length 

ID 

F 

Offset 

TTL 

Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Source  Port 

Dst  Port 

P+SA+DA+SP+DP  TTL+  V+IHL+TOS+TTL+ 

P+SA+DA+SP+DP  P+SA+DA+SP+DP 


■  When  the  density  of  Flow  Key  fields  is  higher, 
this  method  works  faster. 
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Computing  environment  for  the  evaluation 
■  Software  Exporter  program 


□  runs  on  Intel  Xeon  3.06  GHz  HT  architecture 

□  runs  on  Linux  (debian/gnu  Linux  4.0) 

□  compiled  by  gcc4 

■  optimized  option:  -03 


■  Data  used  as  observed  packets: 

□  PCAP  data  published  by  WIDE  project. 

□  contains  6,906,333  packets. 

□  ftp://mawi.nezu.wide.ad.jp/pub/mawi/samplepoint- 
B/20060303/2006030301 00.dump.gz 
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Conclusion 
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■  Introduced  ideas  to  improve  performances  of 
IPFIX  processes 

□  Comparison  method  for  multiple  fields  in  MPs 

□  Copy  method  for  multiple  fields  in  EPs,  and  CPs 

■  These  ideas  are  based  on  defining  the  order 
rule  of  lEs/fields 

□  Our  recommendation:  lEs/fields  are  placed  in  the 
order  referring  to  the  packet  header  fields. 

■  The  order  rule  is  published  as  an  individual 
Internet  Draft 


□  http://tools.ietf.org/id/draft-irino-ipfix-ie-order-03.txt 

□  If  you  agree  with  these  ideas,  work  with  us. 


Abnormal  traffic  detection 

and  alert 


Yiming  Gong 
XO  Communications 

Flocon  2008 
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X 


Q  The  problem  and  request 


•  XO  network 

-  00192  IP  backbone  with  0012  uplinks  in  our  markets 
and  data  centers,  AS  2828 

•  Backbone  level  abnormal  traffic  detection 

-  netflow 
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eet 


Q  The  problem  and  request 


•  Commercial  product  not  good  enough 

-  You  get  what  GUI  gives  you 

-  Very  likely  to  miss  low  volume  traffic  attack 

•  (storm  worm,  scans) 

-  By  default,  alert  based  on  thresholds 

-  Lacking  data  mining  ability 

-  Cost 

•  Free  flow-based  tool 

-  Powerful  but  you  need  tell  them  what  to  do 


■ 
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Q  So  what  we  want 


Detect  network  abnormal  traffic 
-  both  low  and  high  volume 
Non-threshold  based 
Automatically 

Fully  controlled  and  customized 
Data  mining 
Better  be  free 
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Perfect  world 


•  In  a  perfect  world,  traffic  shape  should  be  very  smooth 


•  Spike  means 
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Q  Detection  at  traffic  level  is  not  good 


•  Granularity  is  too  coarse 

•  real  attack  hides  behind  the  huge  traffic 

•  Not  easy  to  tell  what  is  going  on 

•  SYN  attack?  ICMP  ping  flood? 


■ 
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Q  Our  thoughts 


•  Netflow  based 

•  Break  down  raw  netflow  records  to 

-  TCP  SYN,  UDP  total,  ICMP  type|code,  protocol  on  each 
IFIndex  of  each  router 

•  Session 

•  Traffic 

•  For  each  element,  establish  a  dynamic  profile 

•  When  there  is  spike,  something  is  going  on 


■ 
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Dynamic  profile 


IRLosAngeles  TCP  SYN  traffic  on  IFindexSl 


□  Failures 
■  Expected  value 

□SYN  Traffic  on  IF51  Current:  124242  AVERAGE :  58951  MAX:  169622 
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Q  Dynamic  profile 


Establishing  a  profile 

-  Using  NFDUMP  receive,  store  and  process  netflow  data 

-  rrdtool  with  aberrant  behavior  module 

-  rrdtool  (rhttp://oss.oetiker.ch/rrdtool/l 

-  aberrant  behavior  module 

•  Learns  from  past  values  and  uses  them  to  predict  the 
future 

•  Tolerance  band 
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Q  Dynamic  profile 


-  Nfdump 

yiming>  more  IR-syn-Amsterdam 
13  1864 
9  144 
21  85 


-  RrdtOOl 

rrdtool  create  IR-syn-Amsterdam. rrd  -s  300 
DS:13:GAUGE:  1200:0:  U  \ 

DS :  9:  GAUGE:  1200:0:  U  \ 

DS:21:GAUGE:  1200:0:  U  \ 

RRA:HWPREDICT:2016:0. 001:0. 0035:288 
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Failure 


Only  an  entry 

•  IR-syn-Amsterdam:  [1196800800]  RRA[FAILURES][1]DS[13] 
=  1.0000000000e+00 

•  Need  script  do  the  trace  back  work 

-  Every  10  minutes,  scans  the  rrd  output  for  failures 

-  Short-life  spike 

•  window-length  and  failure-threshold 

•  rrdtool  tune  x.rrd  --window-length  5  -failure-threshold  3 
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Failure 


Tracking  down  the  failure 

-  Nfdump  +  netsnmp  +  mysql  +  whois... 

-  Narrowing  down  from  flow  and  getting  the  suspicious  host(s) 

yiming>  more  I R-syn -Amsterdam 
13  1864 
9  144 
21  85 


-  Flow  of  "TCP  +  SYN  bit  only  +  IFindex  13  +  router  Amsterdam" 

-  Finding  the  most  ACTIVE  host(s) 

•  What  is  the  definition  of  active? 

-  Session  number 

-  Traffic  volume 
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Q  Finding  active  host 


Differences  between  these  two  pics? 


IRNewYork  in t erf a 


□ 

c 

!— I 


c 

o 


flj 

Kn 


<U 

Jli 


IRVienna  interface  session 


Wed  20:00 
UDP  on  Interface  50 
5YN  on  Interface  50 
ICMP  ping 


Thu  OOiOO 

Thu  04:  0( 

Wed  20:00 

Thu  00:00 

Thu  04:00 

Thu  0B: 00 

Thu 

12:00 

Thu  16:00 

Current : 

46204 

■  UDP 

on  Interface  6 

Current : 

1013 

AVERAGE: 

779 

MAX: 

1300 

Current : 

6192 

■  5YN 

on  Interface  6 

Current : 

406 

AVERAGE : 

331 

MAX: 

653 

Current : 

1720 

■  ICMP 

Ping 

Current : 

75 

AVERAGE: 

65 

MAX: 

148 
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Q  Finding  active  host 


Different  criterion 

session-icmp*) 

total-number="500"; 

flowfilter= "proto  icmp  and  port  2048  and  if  $if"; 
trigger-number="280"; 


session-syn*) 

total-number="2000"; 

flowfilter="proto  tcp  and  flags  2  and  if  $if"; 

trigger-number="600"; 


•  Things  we  ignored 

-  TCP  SYN  is  supposed  to  be  1,  but  is  10  now 

-  Low  volume  UDP  spikes 
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Netflow  records 


Pull  out  necessary  data 

Generate  alert 

-  Picture,  email 


mmm 


m 


www.xo.com 


CONFIDENTIAL©  2007  XO.  ALL  RIGHTS  RESERVED.  XO,  THE  XO  DESIGN  LOGO,  XOPTIONS  AND  ALL  RELATED 
MARKS  ARE  TRADEMARKS  OF  XO.  ALL  OTHER  TRADEMARKS  ARE  PROPERTY  OF  THEIR  RESPECTIVE  OWNERS. 


x 


Q  Alert 


•  Scan  alert 


>IR  LosAngel es  has  5462  sessions  on  proto  tcp  and  flags  2:  and  if  50  in  5  minutes 


50  =  STRING: 
50  =  string: 


>snapshot  pi cture 
http  :/j 
>One  week | month  picture 
http  :/y 
http: //I 
>Top  IPs  in  10  minutes 


LosAngel es-50-abnormal . png 


LosAngel es-50-abnormal -week. png 
LosAngel es-50-abnormal -month. png 


Date  first  seen 
2007-12-05  08:51:02. 520 
2007-12-05  08:51:20.493 


Duration  Proto 
289.718  any 
130.413  any 


IP  Addr 
218.  233.: 

2:18.  2  341 


2114 

605 


Flows  Packets  Bytes 


2114 

605 


84  560 
24200 


>T o p  IP  info 
*  AS 


IP 

218.  2  33.1 
218.  234  .1 


AS  name 


I  Telecom  Inc. 
Telecom  Inc. 


FQDN 


pps  hpi 

7  2334 

4  14  84 
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xo 


Alert 


•  Day 


IRLosAngel.es  ifindex  1  day  sessions 


□  Failures 

■  Expected  value 

□  Real  session  on  ifindex  SO  Current:  5279  AVERAGE:  2402  MAX:  14633 


www.xo.com 


CONFIDENTIAL©  2007  XO.  ALL  RIGHTS  RESERVED.  XO,  THE  XO  DESIGN  LOGO,  XOPTIONS  AND  ALL  RELATED 
MARKS  ARE  TRADEMARKS  OF  XO.  ALL  OTHER  TRADEMARKS  ARE  PROPERTY  OF  THEIR  RESPECTIVE  OWNERS. 


399 


X 


Q  Alert 


f*  iA 


•  Week 


1.0 


IRLosAngeles  ifindex  one  week  sessions 


□  Failures 

■  Expected  value 

□  Real  session  on  ifindex  SO 


Current:  5683 


AVERAGE:  809 


MAX:  7350 
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Alert 


•  Scan  alert 

>Top  ip  detail 
/  ip  216.  233. 


“Tracer  out  e  (from  hop  5  to  9) 


5.106.  6. 
i  n.  nut .  n 
itt.net  (: 
gi  n. ntt. 
ttt.net  ( 


is 

19.12)  7.069  ms 

7.317  ms 

5.20)  73.034  ms 

73.922  ms 


“Protocol  summary  for  216.  233.196.  2  5 


Proto  Flows  Packets 
6  2116  2116 

17  1  1 


Bytes 

64640 

257 


pps  bps  bpp 

7  2337  40 

0  0  257 


“Sampled  netflow  records 

TCP  2:16.  233.  6000 

TCP  216.  233.  6000 

TCP  215.  233.  6000 

TCP  215.  233.  6000 

TCP  215.  233.  6000 

TCP  215.  233.  6000 

TCP  215.  233.  6000 

TCP  215.  233.  6000 

TCP  215.  233.  6000 

TCP  216. 233.^^H:6000 


65.  99 72:12:  S. 

65.99^^1:7212  5. 

65. 99^^1:^212  5. 

65 . 99.  7  212  _ 5. 

65.99^^|:^212  5. 

65. 99. ^^B:^212  S. 

65. 99^^B:t212  5. 

65. 99.  7  212  _ S. 

65. 99.  ^^■:-?212  5. 

65.  99.  B^^^|:7212  5. 
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Alert 


•  Scan  alert 

/  ip  218.  234. 


**Tr aceroute  (from  hop  5  to  9) 

5  6  5 . 106,  6, 16c.  ptr  .  >.s ,  xo,  r€t 

ss,M; 

“  .se-O.  r:-l.  : h i  '00.  ss  J.:L,  ur 
■5  -  Z-: -1  r  10.  :r  uusiu  ,  is  ,  5; 

9  a .e  - 1 .  r  2 1 .  p "  a  ‘  r  a  0 1 .  > .  5 . 1:  b .  a  i  r 

**Protocol  summary  for  218.234. 

Proto  Flows  Packets  Bytes 

6  605  605  24200 

^Sampled  netflow  records 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 
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106. 

.  ntt. 

.  ntt  5 . 

.  net 


pps  bps  bpp 

4  1484  40 


71.  60.1.  2  01 
"1.  60.  |.  lco 
-1. 60.|  I'll 
"i. 60.|. : :  :: 
"1. 50.  |.  I  I  : 

■1. 60B  ..  :  1 

"1. 60.  |.  151 
^1 . 60. 1 , 3 

"1. 60.  l|  10  5 
71.  60.  l|  111 


:  6588  S. 

:  6588  S. 

:  6588  5. 

:  6588  S. 

:  6588  S. 

:  6588  S. 

:  6588  S. 

:  6588  S. 

:  6588  S. 

:  6588  S. 


50  73 

50  73 

50  73 

50  73 

50  73 

50  73 

50  73 

50  73 

50  73 

50  73 
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xo 


alert 


•  Storm  worm 


IRicmp sumChicago  icmp  type|code  sessions 


□  Failures 

■  Expected  value 

□  Real  session  on  icmp  type|code  2048  Current:  3333  AVERAGE:  3501  MAX:  7413 
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Q  Alert  -  one  week  later 


•  DDos 

IR  LosAngeles  has  177002  sessions  on  proto  tcp  and  flags  2  and  if  50  in  5  mi nutes 


50  =  string: 
50  =  string: 


>snaps  hot  pi ct  ur e 


LosAngel  es-50-abnor mal .  png 


>one  week| month  picture 
http: //| 
http://| 

>Top  IPs  in  10  minutes 


osAngel es-50-abnormal -week. png 
osAngel es-50-abnormal -month. png 


Date  first  seen 
2007-12-12  09:00: 
2007-12-12  09:00: 
2007-12-12  09:00: 
2007-12-12  09:00: 
2007-12-12  09:00: 
2007-12-12  09:00: 
2007-12-12  09:00: 


23. 705 
28. 361 
43. 293 
43. 269 
43.401 
37. 353 
23. 705 


Duration  Proto 
320.317  any 
297. 573  any 
282.273  any 
291.093  any 
289.437  any 
288.445  any 
311.869  any 


89.144. 
211. 211 
211. 206 
211.44.: 
218.48. 
123. 214 
58.127.: 


Fl  ows 
173554 


Packets 

176183 


Bytes 
10.  0  M 


PPS 

550 


bp^ 

261507 


104  8 

1056 

50688 

3 

136; 

692 

70S 

33984 

2 

963 

658 

667 

42688 

2 

1173 

633 

684 

32832 

2 

907 

627 

64  0 

30720 

2 

852 

603 

618 

39552 

1 

1014 

^  :>Top  IP  info 


IP 

89.144 
211.  2111 
211.  2  06| 
211.44. 
218.48. 
123. 214 
58.127. 


AS  name 


utonomus  system 
Telecom  Inc. 


number  fori 


FQDN 
I  Net 


connect i or 


Tel ecom 
Tel ecom 
Tel ecom 
Tel ecom 
Tel ecom 


Inc. 

Inc. 

me. 

Inc. 

Inc. 
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X 


Q  Alert  -  one  week  later 


Traceroute  returns  nothing 


>Top  IP  detail 
/  ip  89. 144. 

^Traceroute  (from  hop  5  to  9)  <• 
**Protocol  s li m m ary  for  89. 144. 


no  traceroute  info  here 


Proto  Flows 
6  173974 


Packets 

176610 


Bytes 
10.  0  M 


pps 

551 


bps 

261950 


bpp 

59 


**sanfipled  netflow  records 


TCP 

TCP 

TCP 

TCP 

TCP 

TCP 

TCP 

TCP 

TCP 

TCP 


219.  2  5- 
123. 212 
2:19.  233 
58.124 
219.251 
58.123 
123.  2 
211. 20 
221.143 
218.  23- 


89.144. 

89.144. 

89.144. 

89.144. 

89.144. 

89.144. 

89.144. 

89.144. 

89.144. 

89.144. 


.  S. 
.  5. 

.  s. 
.  s. 
.  s. 
.  s. 
.  s. 
.  s. 
.  s. 
.  s. 


64 

64 

64 

64 

64 

64 

48 

64 

64 

48 


50 

50 

50 

50 

50 

50 

50 

50 

50 

50 


11 

10 

10 

10 

10 

11 

10 

11 

10 

11 
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X 


Q  Alert  -  one  week  later 


/  ip  211.211. 

**Tr acsroute  (from  hop  5  to  9) 


L.  US  .  XO| 
st  (6  5. 
net  (20| 
st  (6  5. 

.  nst  (6| 


**Protocol  summary  for  211.211. 


Proto 

6 


Fl  ows 
1057 


Packsts 

1065 


Bytss 

51120 


pps 

3 


^samplsd  nstflow  rscords 


TCP 

TCP 

TCP 

TCP 

TCP 

TCP 

TCP 

TCP 

TCP 

TCP 


211. 211. 
211. 211. 
211. 211. 
211. 211. 
211. 211. 
211. 211. 
211. 211. 
211. 211. 
211. 211. 
211. 211. 


29937 

32301 

30596 

35573 

26497 

31263 

27378 

34829 

28267 

59695 


89.144. 

89.144. 

89.144. 

89.144. 

89.144. 

89.144. 

89.144. 

89.144. 

89.144. 

89.144. 


.85)  7.113  ms 

521  ms 
6. 505  ms 
604  ms 
6.  511  ms 


bps  bpp 
1374  48 


:  SO 
:  80 
:  80 
:  80 
:  80 
:  80 
:  SO 
:  80 
:  80 
:  80 


.  s. 
.  s. 
.  s. 
.  s. 
.  s. 
.  s. 
.  s. 
.  s. 
.  s. 
.  s. 


48 

48 

48 

48 

48 

48 

48 

48 

48 

48 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 


50 

50 

50 

50 

50 

50 

50 

50 

50 

50 


11 

11 

11 

11 

11 

11 

11 

11 

11 

11 
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eet 


Alert  -  one  week  later 


200 

180 

160 

140 

120 

100 

80 

60 

40 

20 


IRLosAngel.es  ifindex  1  day  sessions 


Tue  12:00  Tue  16:00  Tue  20:00  Wed  OOiOO  Wed  04  :  00  Wed  08:00 


□  Failures 

■  Expected  value 

□  Real  session  on  ifindex  50  Current:  161308  AVERAGE:  9734  MAX :  161308 
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Alert  -  whitelist 


•  Special  customers 

>Top  ip  info 


*  AS  IP 

■  64.  39. 

63.  24  5^^^fl 

>Top  IP  detail 
/  ip  64.  39. Bj-.  66 


AS  name 


FQDN 

corporation  |  core27^^^^^H 


**Tr ace route  Cfromi  hop  5  to  9) 

5  6.  ll: 

B.oo-  fl  I;  6.-31- 

B.  co-  fl  j  I  66. 

I  CO-  -fl  I  66.  -s 


com. 


**Protocol  summary  for  64 . 39.  | 

Proto  Flows  Packets  Bytes 
6  182  200  8584 

j!7  1  1  58 

^Sampled  netflow  records 

TCP  64. 39.  2681 

TCP  64.  B9.^^^fl:  37672 

TC P  64. 39. 38206 

~=P  64. 39. 35315 

TCP  64. 39. 39700 

“CP  64. 39.^^H:40293 

TCP  64. 39.^^H:40210 

TCP  64. 39.^^H:40603 

TCP  64. 39.^^H:41626 

TCP  64. 39.^^H:41565 


pps  bps  bpp 

0  231  42 

0  0  58 


63.  245.  ^^H:25  S. 

63. 24  5.  3  54  59  _ S. 

63. 24  5.  ^^■:-~123  _ S. 

63. 24  5.  ^^■:2£Tij  _ 

63. 245.  34^39 _ S. 

63. 24  5.  ^^H:26T33  _ S. 

63. 24  5.  5  3606  _ S. 

63. 24  5.  ^^■:63S22  _ S. 

63. 24  5.  5  54  50  _ S. 

63.  24  5.  ^^■:2361  S. 
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eet 


Alert  -  whitelist  and  misc 


Whitelist  <cont> 

-  Email  servers 

-  We  don't  want  to  miss  real  attack  even  if  an  IP  is  on 
whitelist 

Alert  email 

-  Suppression  period 

-  Subject 

•  12-05  abnormal  sessions  at  LosAngeles  proto  tcp  and 
flags  2  and  if  50 


■ 
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Data  mining 


•  Database 

-  3  tables 

•  IP,FQDN,AS 

•  Summary 

•  Raw  netflow  data 

-  Data  mining 

•  Which  peering  neighbor  sends  out  most  attack  traffic, 
who  is  the  most  attacked,  which  port  is  the  most 
popular  being  scanned. ..etc. 
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Q  Data  mining 


•  Database 

-  3rd  party  outside  data 

•  Dshield  TOP  10000 

•  Dshield  AS 

•  CBL  data 

•  Mynetwatchman 

•  Our  own  darknet  project  output 

•  Other  private  outside  data 

-  If  XO  host  involved,  we  will  go  through  these  table 


■ 
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problem 


•  Problem 

-  Peering  neighbor 

-  Alert  correlation 

•  But  you  can  do  it  in  database. 


,  — 


mmm 


m 
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What  you  need 


Nfdump,  rrdtool,  mysql,  net-snmp,  apache,  some  unix 
commands 

A  box  with  linux  installed 
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For  more  info 
-  yiming.gong@xo.com 


Thanks! 


■ 
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EE/CS  Network  Infrastructure 


•  Three  buildings  with  one  router 

-  (Gates)  Computer  Science 

-  (Packard)  Electrical  Engineering 

-  (Allen)  Center  for  Integrated  Systems 

•  Composition 

-  25  VLANs  controlled  by  disparate  groups 

-  10,000  IP  addresses  (about  half  are  active) 

-  Eclectic  mix  of  Windows,  Linux,  Solaris,  OS-X,  ... 

-  No  firewall  beyond  minor  university  filters 

•  Analysts 

-  A  half-dozen  people  with  network  (and  other)  responsibilities 
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Incident  Investigation  Process 


•  Find  answers  to  a  set  of  classic  questions. . . 

-  Who 

-  What 

-  When 

-  Where 

-  Why 

-  How 

•  ...using  an  iterative  process 

-  Inspect  events  of  a  focus  node 

-  Augment,  refine,  filter  data 

-  Compare  events  of  related  nodes,  looking  for  correlation 

-  Pivot  on  an  “interesting”  node  to  refocus 
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Network  Data  Sources 

(each  step  is  orders  of  magnitude  more  volume) 


•  Traffic  counters  (snmp,  mrtg . ) 

-  Configurable  in  network  devices 

•  Event/Alert  logs  (Sysiog,  httpd,  snort,  ...) 

-  Collected  by  firewalls,  IDS,  individual  machines  and  services 

•  FIOWS  (Netflow,  YAF,  Argus,  ....) 

-  Typically  collected  at  border  routers  or  taps 

•  Packet  Headers  /  Traces  (tcpdump,  wireshark, ...) 

-  Collected  at  switches,  routers,  or  taps 
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Network  Flows 


•  Advantages 

-  Relatively  uniform  and  increasingly  available 

-  Hard  to  subvert 

-  Mitigate  privacy  concerns 

-  Largely  insensitive  to  encryption 

•  Disadvantages 

-  Still  voluminous  compared  to  event  logs 

-  Aggregate  measure 

-  Lack  content 
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Flow  Capture  and  Data  Management 


•  Sensor 

-  Span  ports  from  two  Cisco  backbone  switches 

-  See  all  layer  3  traffic  for  three  buildings  (not  just  external) 

-  Argus  capture  of  bidirectional  ICMP,  UDP,  TCP  flows 

•  Collector 

-  Raw  flows  from  sensor  are  multicast  locally  in  realtime 

-  Hourly  files  from  sensor  compressed  and  archived 

-  20-30M  (peak  70M)  Argus  flows/day  (~1G  compressed) 

-  Retain  several  months  of  data  online  for  analysts  to  access 
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Support  flat  files  and  database  tables 


•  Flat  text  files 

-  Familiar  and  familiar  tools 

-  Extracts  useful  for  exchange  and  reporting 

-  Straightforward  sequential  processing 

-  Import  to  other  tools  for  aggregation  and  analysis 

•  Relational  databases 

-  No  longer  exotic 

-  Suitable  for  large  data  volumes 

-  Greater  expressibility  for  queries 

-  Built-in  support  for  aggregation  and  analysis 
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Database  Infrastructure 


•  MySQL  server  running  on  collector 

-  Live  flows  from  sensor  inserted  in  real-time 

-  Daily  tables  recreated  from  archived  raw  flows 

-  Monthly  “merge”  tables 

-  Anonymize  extracts  for  research  with  CryptoPAN 

•  Flow  schema  tuning 

-  T ransform  src/dst  to  local/remote 

-  Add  ASN  (routeviews.org)  and  local  VLAN  metadata 

-  Convenience  columns  for  locality,  local  role,  dst  port 

-  Index  most  dimensions  (adds  about  50%) 

-  Tables  +  indices  ~2G/day 
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Flows  in  Incident  Handling 


•  Worms  and  Trolls 

-  Volume  and  promiscuity 

•  Immaculate  Intrusions 

-  Scrubbers,  Keyloggers,  and  Remote  Tunnels 

•  Botnets 

-  Beaconing  to  Command+Control  Hosts 
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Traffic  Volume 


•  Windows  Esbot  worm  circa  2005 

-  Spread  via  PNP  buffer  overflow 

-  Installed  backdoor  trojan 

-  Victim  turns  into  attacker 

•  Report 

-  Overall  traffic  suddenly  increased  an  order  of  magnitude 

•  Analysis 

-  Flow  distribution  showed  port  445  at  500-1000  flows/sec 

-  Keyed  on  445  traffic  to  identify  attackers 

-  Used  “flow  monitor”  to  reveal  local  compromises 


FloCon  2008 


10 


Esbot  on  the  Flow  Monitor 


MonitorMain 


gm  m 


In 


Out 


□ 


irt  =  445  and  i s_l_o n ly  =  0  and  r_asn=32 


log(flows,  packets,  bytes) 

Last:  Wed  Aug  17  12:28:18  PDT  2885 
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Promiscuity 


•  SSH  Troll 

-  Intruder  gains  access  to  local  machine 

-  Installs  SSH  troll 

-  Launches  attack  on  remote  networks 

•  Report 

-  Odd  outbound  traffic  spike  from  local  IP 

•  Analysis 

-  Flow  distribution  showed  many  IPs,  few  ASNs,  single  port 

-  Backtrack  in  time  to  find  initial  SSH  compromise 

-  Pivot  reveals  other  victims 
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SSH  Troll:  Volume  +  Promiscuity 
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SSH  Troll:  Identifying  targets 
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SSH  Troll:  Locate  Compromise 
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SSH  Troll:  Pivot  to  identify  other  victims 
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Immaculate  Intrusions  -  Keyloggers 


•  Unprotected  X-Window  server 

-  Intruder  maps  0x0  pixel  client  and  signs  up  for  keypress  events 

-  Steals  credentials  for  other  machines  from  local  user 

-  Uses  credentials  to  login  to  experimental  machine 

•  Report 

-  Experimental  machine  crashes  when  intruder’s  tools  fail 

•  Analysis 

-  Local  user  logged  in  when  user  not  present 

-  Discover  open  X-server  on  user’s  desktop  machine 

-  Backtrack  in  time  to  find  keylogger  flows 

-  Pivot  reveals  other  victims 
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Immaculate  Intrusions  -  Scrubbers 

•  Unpatched  Linux  machine 

-  Unpatched  server  vulnerable  to  remote  root  compromise 

-  Intruder  installs  backdoor,  trojan  binaries,  and  scrubs  logs 

-  Uses  trojan  ssh  to  steal  credentials  of  local  users 

-  Uses  ssh  known_hosts  data  to  attack  other  local  machines 

•  Report 

-  Local  machine  two  hops  away  found  sending  spam 

•  Analysis 

-  Backtrack  of  login  sessions  leads  to  compromised  machine 

-  Trojan  binaries  found,  but  no  plausible  root  logins 

-  Flow  logs  show  original  compromise  and  backdoor  logins 

-  Pivot  reveals  other  victims 
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Immaculate  Intrusions  -  Tunnels 


•  Tunnels 

-  Intruder  compromises  desktop  machine  running  VNC  client 

-  Desktop  machine  has  forwarded  ports  over  ssh-tunnel 

-  Intruder’s  traffic  is  tunnelled  and  reparented  inside  cluster 

•  Report 

-  Apparent  Nessus  scan  of  isolated  cluster  machine 

•  Analysis 

-  System  logs  of  head  node  show  no  logins 

-  Flow  logs  show  massive  ssh  traffic  from  compromised  machine 


FloCon  2008 


19 


lsis:Visual  Analysis  of  Flow  Data 

(see  paper  by  Phan  et  al  in  VizSec  2007) 


Progressive  Multiples 

•  Make  exploration  history  visible 

•  Reorder  rows  to  reveal  structure 
and  event  sequencing 
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Beaconing 


•  Botnet  zombie 

-  Intruder  gains  access  to  local  machine 

-  Installs  IRC  client  bot 

-  zombie  bot  “calls  home”  periodically 

•  Report 

-  Recurrent  traffic  to  suspect  IRC  servers 

•  Analysis 

-  Backtrack  in  time  to  find  initial  compromise 

-  Observe  tool  download  and  installation 

-  Pivot... 
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IRC  bot:  Timeline  Investigation 


Group  #  Timeline 


0-0  75.64.71.22  3  2006-03-21  01 :00  to  2006-03-23  23:00  every  1200s.  Aggregation:  count[*)  with  max  of:  2019,0  and  linear  scaling 


Froze  D PORT  6667 
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The  Event  Table 


75.64. 7 1 .22  and  not(l_role  >  0  and  d_port=80)  |^T]pT|f5<l 


2 006-08-2 1  10:00:01  to  2006-0B-21  10:58:28 
0.0 

75.64.71.22 

and  not(l_role  >  0  and  d_port=80) 


l_weekday 

l_hour 

3 

GMTfirst 

2006-08-21  10:00:01 

duration 

0.002 

locality 

1 

Lrole 

t-3 

proto 

1 

l_asn 

32 

l_vtn 

71 

inet_ntoa(l_ipn) 

75.64.71.22 

l_port 

8 

r_asn 

|  26101 

r_vln 
|  24288 

inet_ntoa(r_ipn) 

66.94.230.32 

r_port 

8 

d_port 

0 

l_pkt 

Lbyte 

196 

l_abyte 

128 

r_pkt 

r_byte 

196 

r_abyte 

12, 

3> 

2006-08-21  10:00:03 

0 

3 

17 

32 

71 

75.64.71.22 

37373 

32 

64 

75.64.67.192 

37373 

37373 

0 

0 

0 

1 

177 

13 

3 

2006-08-21  10:00:03 

0.001 

2 

3 

17 

32 

71 

75.64.71.22 

7001 

32 

16401 

75.64.15.96 

7001 

7001 

2 

140 

56 

4 

450 

28 

3 

2006-08-21  10:00:04 

0.009 

1 

-3 

1 

32 

71 

75.64.71.22 

8 

26101 

|  24288 

66.94.230.32 

8 

0 

2 

196 

128 

2 

196 

12 

2 

3 

2006-08-21  10:00:18 

0.506 

2 

2 

6 

32 

71 

75.64.71.22 

22 

32 

|  17116 

75.66.189.156 

22 

22 

2 

236 

128 

2 

108 

3 

2006-08-21  10:01:02 

0.002 

1 

-3 

1 

32 

71 

75.64.71.22 

8 

26101 

24288 

66.94.230.32 

8 

61 

2 

196 

128 

2 

196 

12 

2 

2006-08-21  10:01:17 

0.002 

1 

-3 

1 

32 

71 

75.64.71.22 

8 

26101 

|  24288 

66.94.230.32 

8 

0 

2 

196 

128 

2 

196 

12 

3 

2006-08-21  10:01:18 

0.441 

2 

2 

6 

32 

71 

75.64.71.22 

22 

32 

|  17116 

75.66.189.156 

22 

22 

2 

236 

128 

2 

108 

2 

3 

2006-08-21  10:01:45 

0.001 

1 

3 

1 

32 

71 

75,64.71.22 

0 

15243 

|  7936 

147.31.67.105 

0 

0 

2 

184 

116 

2 

184 

11 

2 

3 

2006-08-21  10:01:49 

0.109 

2 

-3 

6 

32 

71 

75.64.71.22 

45075 

32 

17174 

75.67.9.109 

45075 

25 

8 

551 

111 

14 

1202 

43 

2 

3 

2006-08-21  10:01:52 

0.001 

2 

3 

17 

32 

71 

75.64.71.22 

7001 

32 

16401 

75.64.15.111 

7001 

7001 

2 

140 

56 

4 

450 

28 

2 

3 

2006-08-21  10:01:54 

27.002 

2 

-3 

17 

32 

71 

75.64.71.22 

37396 

32 

7 

75.64.24.227 

37396 

53 

2 

148 

64 

4 

296 

12 

2 

3 

2006-08-21  10:01:59 

19.998 

2 

-3 

17 

32 

71 

75.64.71.22 

37396 

32 

7 

75.64.24.201 

37396 

53 

2 

148 

64 

4 

296 

12 

3 

2006-08-21  10:02:03 

0.002 

1 

-3 

1 

32 

71 

75.64.71.22 

S 

26101 

|  24288 

66.94.230.32 

8 

6 

2 

196 

128 

2 

196 

12 

3 

2006-08-21  10:02:06 

4.99 

2 

-1 

17 

32 

71 

75.64.71.22 

7001 

32 

14 

75.64.22.185 

7001 1 

7000 

8 

658 

322 

0 

0 

31 

2006-08-21  10:02:13 

7 

i 

-1 

17 

32 

71 

75.64.71.22 

7001 

3 

17920 

18.70.0.6 

7001 

7003 

10 

872 

452 

0 

0 

2 

3 

2006-08-21  10:02:18 

0.379 

2 

2 

6 

32 

71 

75.64.71.22 

22 

32 

17116 

75.66.189.156 

22 

22 

2 

236 

128 

2 

108 

2 

31 

2006-08-21  10:02:19 

0 

2 

-1 

1 

32 

71 

75.64.71.22 

3 

32 

7 

75.64.24.201 

3 

3 

2 

204 

136 

0 

0 

n 

2006-08-21  10:02:19 

2.004 

2 

-3 

17 

32 

71 

75.64.71.22 

37401 

32 

7 

75.64.24.227 

37401| 

53 

1 

74 

32 

2 

148 

6 

2006-08-21  10:02:20 

5.5 

1 

-1 

17 

32 

71 

75.64.71.22 

7001 

3 

37120 

18.145.0.25 

7001 

7003 

8 

724 

388 

0 

0 

2 

2006-08-21  10:02:21 

0 

2 

-1 

17 

32 

71 

75,64.71.22 

37402 

32 

7 

75.64.24.201 

37402 

53 

1 

74 

32 

0 

0 

2 

n 

2006-08-21  10:02:21 

0 

2 

-1 

1 

32 

71 

75.64.71.22 

3 

32 

7 

75.64.24.227 

3 

3 

2 

204 

136 

0 

0 

2 

2006-08-21  10:02:27 

0 

2 

-1 

17 

32 

71 

75.64.71.22 

37403 

32 

7 

75.64.24.201 

37403 

53 

1 

74 

32 

0 

0 

2 

3 

2006-08-21  10:02:30 

0.002 

i 

-3 

1 

32 

71 

75.64.71.22 

8 

26101 

24288 

66.94.230.32 

8 

0 

2 

196 

128 

2 

196 

12 

2 

3 

2006-08-21  10:02:51 

0 

2 

-1 

1 

32 

71 

75.64.71.22 

3 

32 

7 

75.64.24.201 

3 

3 

i 

102 

68 

0 

0 

2 

3 

2006-08-21  10:02:51 

0 

2 

-1 

1 

32 

71 

75.64.71.22 

3 

32 

7 

75.64.24.201 

3 

3 

i 

102 

68 

0 

0 

2 

3 

2006-08-21  10:02:51 

0 

2 

1 

17 

32 

71 

75.64.71.22 

37402 

32 

7 

75.64.24.201 

37402 

37402 

0 

0 

0 

2 

148 

6 

2 

3 

2006-08-21  10:02:51 

0 

2 

1 

17 

32 

71 

75.64.71.22 

37403 

32 

7 

75.64.24.201 

37403 

37403 

0 

0 

0 

2 

148 

6 

2 

3 

2006-08-21  10:03:04 

0.002 

1 

-3 

1 

32 

71 

75.64.71.22 

8 

26101 

24288 

66.94.230.32 

8 

0 

2 

196 

128 

2 

196 

12 

2 

3 

2006-08-21  10:03:16 

0.546 

3 

3 

6 

32 

71 

75.64.71.22 

25 

32 

64 

77.232.79.23 

25 

25 

9 

937 

331 

9 

691 

10 

2 

3 

2006-08-21  10:03:18 

0.548 

2 

2 

6 

32 

71 

75.64.71.22 

22 

32 

17116 

75.66.189.156 

22 

22 

2 

236 

128 

2 

108 

2 

3 

2006-08-21  10:03:25 

0.224 
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From  Event  Table  to  Event  Plot 


Event  Table  Event  Plot 
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From  Event  Table  to  Event  Plot 


Event  Table  Event  Plot 
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Event  Plot 


i  Event  plot  for  75-64.71.22  and  not{l_role>0  and  d_port=8Q)  and  locality=1 
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IRC  Bot:  Initial  SSH  Connection 
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IRC  Traffic  on  port  6667 


Event  plot  for  75.64.71.22  and  not(l_role>0  and  d_port=8G)  and  locality  =1 
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Download  of  Intrusion  Tools 


i  Event  plot  for  75.64.71.22  and  not(l_role>0  and  d_port=8C)  and  locality=1 
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Reordered  Rows 


i  Event  plot  for  75.64.71.22  and  notfl_role>G  and  d_port=80)  and  locality=1 
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Switch  to  Ordinal  Time 
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Mine  the  Gap 


■ 

i  Event  plot  for  75.64.71  -22  and  not(l_role>0  and  d_port=8G)  and  locaMty=1 
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Sequence  of  Intrusion 
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Future  Work 


•  Scalable  query  performance 

-  Want  to  query  billion  row  tables  at  interactive  speeds 

-  Column-oriented  database 

-  Distribute  across  commodity  cluster 

•  Finding  network  signatures 

-  Bottom  up  capture  of  analyst  domain  knowledge 

(see  our  paper  by  Xiao  in  VAST  2006) 

-  Top  down  search  for  frequent  patterns 

-  Build  disparate  flows  into  behaviors  (boot,  logon,  mail,  print, 
surf,  ...) 

•  Modeling  Local  Machine  Behavior 

-  Shift  the  burden  to  the  attacker? 
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The  Ripple  decoded 
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Very  large  scale  observation 

•  Carrie  Gates  was  interested  in  the  degree  of  fan  out 
from  outside  to  inside  for  her  scan  detection  work. 

•  How  many  outside  hosts  use  exactly  one  inside 
host  /  service  pair,  (unique  destination  address/port) 

•  In  the  beginning,  we  did  it  the  hard  way,  but  Bloom 
filters  can  be  used  to  find  unique  slP,dlP,dport 
exemplar  flows 

•  If  we  make  a  source  IP  bag  from  the  exemplar  flows, 
the  counts  will  be  the  number  of  different  host  / 
service  pairs  contacted  by  a  given  source  host. 

•  Invert  the  bag  to  determine  how  many  entries  have  a 

count  of  1 , 2,  3 .  Plot  hourly  results  for  a  week 
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Outside  to  inside  -  July  2003 


Number  of  Unique  Source  IPs  that  Contacted  X  Destination  IPs  Per  Hour 

(jis  routed,  TCP  only) 
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Developing  the  contact  surface 


•  In  the  absence  of  the  disturbance  seen  on  the 
orevious  page,  contact  lines  seem  to  follow  a  power 
aw  type  of  distribution 

•  or  do  they1. 

•  We  think  this  is  really  at  least  3  separate 
processes 


•  VLF  noise 

•  “normal  activity” 

•  Bulk  scanning 


(3 

3  ioo 


Number  of  Sources  that  Contacted  X  Destinations  Per  Hour 
(incoming  TCP  routed) 

— . — . — . — . . . . < - - — ■ — < — 1  ■  1  i - « - ' - - — 1 — 1 

Monday.  Sept.  15.  2003  • 

eA1 1 .763367  *  XA-1 .957496  - 


10  100 
LOG:  Number  of  Destinations 


1  everything  is  a  straight  line  on  log/log  paper,  especially  if  you  use  a  fat  marker 
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Internet  wide  disturbance 

•  The  ripple  in  what  would  otherwise  be  a  fairly  straight 
log/log  plot  of  connectivity  was  observed  from  at  least 
Jan  -  Aug  2003. 

•  It  went  away  when  Blaster  appeared  in  Aug  2003. 

•  A  similar  ripple  existed  from  Feb  1 1  to  May  31  2004 
coinciding  with  the  lifetime  of  Welchia-B 

•  In  this  case,  the  ripple  is  due  to  a  few  hundred 
machines  scanning  at  a  low,  fixed,  rate  induced  by 
a  loop  with  a  “sleep”  system  call. 

•  In  both  cases,  they  persisted  until  killed,  not  patched. 

•  We  have  been  told  that  the  ripple  is  back. 
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Details  of  the  Welch ia.B  event  -  onset 


Number  ot  Sources  that  Contacted  X  Destinations 
incoming  TCP  routed,  per  hour,  averaged  across  each  day) 
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Details  of  Welchia.B  -  demise 


Numcer  of  Sources  that  Contacted  x  Destinations  Per  Hour  (AVG) 
(incoming  TCP  routed) 
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Design  Time  Coordination 

•  The  sleep  in  the  scan  loop  of  Welchia.B  points  to  a 
form  of  loose,  design  time,  coordination. 

•  All  members  of  the  cohort  scan  at  approximately  the 
same  rate,  using  the  same  random  generation 
scheme  but  with  a  different  random  seed. 

•  If  we  captured  all  the  scans  from  each  member  of  the 
cohort,  we  would  expect  to  see  a  small,  tight,  cluster 
of  scanners  all  contacting  nearly  the  same  number  of 
targets. 

•  We  observe  only  a  small  portion  of  the  address 
space  and  see  a  small  percentage  of  the  scans  from 
each  host  with  substantial  interhost  variation. 
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This  fall,  we  simulated  the  perturbations 

•  Generated  approximation  of  unperturbed  background 

•  Dont  care  about  process,  only  appearance 

•  Simulated  perturbation  process  parameterized  on: 

•  Number  of  sources 

•  Probe  rate  /  source 

•  %  of  IPv4  monitored 

•  %  of  probes  intercepted 

•  For  ripple  or  wave,  %  monitored  =  %  intercepted 

•  For  scans  targeting  monitored  network  they  are  different 

•  Looked  at  observability  as  a  function  of  parameters. 
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Background  only  -  main  line  process 

Contact  Surface  for  24  hours,  4.0%  IPv4  monitored 
0  sources,  0  probes/hour,  4.0%  hit 
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Simulating  the  ripple 

•  For  each  source,  for  each  probe,  j  emitted  during 
an  observation  period;  we  generate  a  random  in 
{0..1.0}. 

•  If  Rj j  is  <  the  %  of  IPv4  monitored,  it  is  a  hit. 

•  Use  the  hit  count  to  select  the  appropriate  cell  in  the 
background  traffic  contact  line  and  add  1  to  it. 

•  source  Sj  hit  that  number  of  destinations  during 
the  simulated  observation  period  period. 

•  Plot  the  modified  contact  line  in  either  2D  or  as  part 
of  a  3D  contact  surface. 
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A  plausible  ripple 


Contact  Surface  for  24  hours,  4.690%  IPv4  monitored 
1000  sources,  1800  probes/hour,  4.690%  hit 


_  * 
*  * 
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fDALHOUSIE  Faculty  of  Computer  Science 

UNIVERSITY  ^  ,  . 

inspiring  Minds  Privacy  and  Security  Lab 


Observability:  1000  probers  /16  coverage 


Contact  Surface  for  24  hours,  0.390%  IPv4  monitored 
1000  sources,  1800  probes/hour,  0.390%  hit 
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Observability:  100  probers  12  X  /8  cover 


Contact  Surface  for  24  hours,  4.690%  IPv4  monitored 
100  sources,  1800  probes/hour,  4.690%  hit 
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Simulated  and  real  spikes. 

•  The  spikes  appear  when  the  percentage  of 
intercepted  probes  is  high. 

•  Occurs  when  the  probes  fall  mostly,  95%+,  in  the 
monitored  address  space. 

•  At  100%,  the  spike  becomes  a  point 

•  First,  we  simulate  the  spike. 

•  Next  is  a  one  month  contact  line  for  our  122,  based  on 
Bloom  filtering  for  unique  sIP,  dIP  pairs. 

•  Note  points  at  254,  508,  762  and  1016  addresses. 
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The  spike  in  the  Welchia.B  displays 


Contact  Surface  for  24  hours,  4.690%  IPv4  monitored 
20  sources,  720  probes/hour,  95.0%  hit 


Contact  Destinations 


rie  Gates 


Outside  Hosts 
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g  Minds 


Contact  line  for  April  2006  for  a  122 

Contact  Surface:  2006/04/01  TOO  for  1  month. 
Bloom  filtered  for  unique  sIP,  dIP 


1  10  100  1000 
Inside  Hosts 
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Future  work 

•  We  would  like  to  visit  or  revisit  the  data  for  current 
and  past  perturbations. 

•  Develop  analytical  techniques  for  identifying  cohorts 
of  players  exhibiting  abritrary,  but  similar 
characteristics. 

•  Explore  other  regions  of  the  contact  surface 

•  Link  visualization  to  source  /  cohort  identification  in 
the  visualization  tool  we  are  developing  for  DHS. 
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Visual  Representations 

of  Flow  Data 

and  the  Value  of  Visual  Language 


Presented  by  Sunny  Fugate 

Space  and  Naval  Warfare  Systems  Center,  San  Diego 


Human-Machine  Efficiency 


haptic 


Over-Learned:  Feedback 


closed-loop  :  correct  errors  in  production 


open-loop  :  correct  errors  in  semantics 


numin  cofmtjon 


Culture 

2  basic  forms  of 
dosed*  loop 


visual  /  aural 
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Human-Machine  Efficiency 


Under-Learned:  Representation 


arbitrary  metaphor  association  representational  indexical 


Culture/Domain  Specificity 
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Human-Machine  Efficiency 

Over-Learned:  Feedback  -  haptic  vs  visual/aural 


Haptic  Feedback  Visual  /  Aural  Feedback 


Linear  access 


Language  Domains 


Mathematics 


American  English 
Grammar/Structure 


Cultures  and  knowledge  domains  don’t  necessarily 
use  the  same  lexicon  or  even  the  same  grammar! 


Well-formed 
American  English 


Non-grammatical 
American  English 


Medicine 


How  does  the  CND  lexicon  map  to  common  language? 
Technical  language?  Military/tactical  language? 
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Flow  in  hyperbolic  space 


©  2007  Sunny  Fugate 


♦  3  month  SSC  project  in  2002 

*  discover  and  apply  network  visualization  tools 
Hyperviewer:  quasi-hierarchical  hyperbolic  space 

'fish-eye’  3-d 

Created  by  Stanford  researcher  Tamara  Munzner 


*  \  *•  r 


Flow  in  hyperbolic  space 


Easily  adapted  to  a  forced-hierarchy  view  of  flow 
Opensource  C++  library  and  Ul 
Experimented  with  visual  methods 


colors 
graph  cycles 
scaling 
text  labels 


graph  size 

search  automation 
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Symmetry  in  port  access  from  3  separate  clients. 
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src/dst  ports  colored  red/blue 


-  - .  .  "  .  ' . '  - . ” . ~  . .  . "  . . . . . ~~  ’ _ . ~ 


Hierarchy  showing  client  subnet  and  server  ports 
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Shapes  Vector 


•  Acquired  by  DARPA  in  2002 

•  Developed  by  Australian  DSTO 

(Defence  Science  Technology  Organisation) 

JTF-GNO  pilot  program  from  2003-2006 

What  is  it? 

-  Intelligent  Agents  gather  information  and  produce  inferences 
Gathers  information  from  multiple  sources 

pcap,  low,  Snort,  syslog,  etc 

G  IAs  performs  automated  data  correlation  &  <nowledge  extraction 
Integrates  visual  and  command-line  analysis 

-  Integrated  visualization  makes  use  of  uman  vision 
g  Supports  visual  analysis  and  decision-making 
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Shapes  Vector 


Contextual 

Spatial 

Temporal 

Visual 


spatial,  temporal,  social,  topological 
physical  geography  or  metaphor 
sequences  in  time,  correlated 
use  visual  language  to  depict  objects  &  events 
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SVKA 


SVKA 


2ctu  re 


Agents  can  be  written  in  any  language  -  must 
conform  to  the  SV  ontology  and  knowledge 
architecture  (SVKA)  specification 

^  Sensors  can  be  built  to  wrap  nearly  any 

information  source  -  must  produce  SV 
ontology 

o  SV  ontolog;  is  a  knowledge  description 
language  for  network  defense 
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Shapes  Vector -Visual  Language 

Easily  defined  visual  mappings 
No  applied  theory  of  visual  language 


shape/color/scale 


texture/icon 
connection  /  topology 


movement 


packet  events,  information  exchange,  attribute  changes,  attribute 
values,  host  id,  software,  processes,  machine  purpose,  network 
topology,  social  topology,  intrusion  events,  event  type,  event  priority, 
client  vs  server,  routing, ... 
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internet  v 


Automated  layout  to  arrange  hundreds  of  sub-graphs  in  a 
non-overlapping  manner. 
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Topological  layout  discovered  using  hints  in  the  data 
(e.g.TTL) 
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Color,  shape,  texture,  icon,  location,  arrangement 
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Shapes  Vector  Flow  Viewer 


JTF-GNO  funded  effort  to  implement  SV 
*  Use  SV  architecture  and  components 
DARPA  demo  system  >  operational  system 
New  scripts,  sensors,  agents,  and  GUI 
Results 

A  visual  augmentation  of  CLI 
Produces  a  view  of  social  topology 
intuitive  view  of  gobs  of  data 
static  topolog]  and  event  replay 
Links  statistical  views  and  topology  view 
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Flow  Viewer 

GUI 

♦  multiple  stats  views  linked  to  visuals 

♦  playback  specific  ranges  &  loop  | 
adjust  replay  velocity 
time-skip 

IP  and  attribute  hotlists 
dynamic  filtering  controls 
managed  rwfilter 
filter  using  SV  ontology 

integration  between  flow, Trickier,  IDS,  &  PCAP 


Flow  Viewer 
Sensors 


SVKA 


IP  Networks 


consumes  rwf  &  rwcut  data 


Trickier  Agent  queries  database  for  most  recent  attributes 


PCAP  Agents  queries  &  reconstructs  TCP  sessions 


IDS  Agents 


processes  IDS  logs 


Flow  Viewer 
Intelligent  Agents 


TricklerAgent 

FlowSensor  uses  correlations  from  FlowAgent 

Converts  flow  into  ontology  quei7  made  on  ever/  unklue  seen 

♦  produces  facts  ~  produces  visual  events 


Flow 

Sensor 


-  correlates  records 
o  counts  and  corroborates 
o  produces  inferences 

produces  visual  events 


Trickier 


Flow  Viewer 
Visual  Language 


Leverage  cultural  knowledge 


Use  metaphors  for  abstract 


Color  by  ownership 

USA  AF  USN 

USMC  Joint  Govt 

Internet 
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Flow  Viewer 
Visual  Language 


Test  i 


Area  #1 


CAT-5 


KVM 

Switch 


Video/Key/Mouse 


Shapes  Vector 
V440 


stallation 


Analyst  Workstation 


RGB  Video 


Display 

Management 

System 


Flow  Viewer 


Visualization 

♦  Tested  using: 

♦  1 00-5000  nodes 
*IM-3M  flows 
I0K-300K  flows  per  hour 

Integrated  filtering  (rwfilter,  SVKA  filtering,  visual) 
Visual  ID 

Queries 

Grouping  (e.g.  domain,  netblock,  vulnerability) 

-  Replay  -mode  or  Real-time 
Historic  visual  context 

Replay  ‘on  top  of  known  incident 


Flow  Viewer 
data  prep 


Include 

♦  Incoming  &  outgoing 
Hub  &  core-to-core  traffic 
Widest  possible  port  ranges 

Time-span  wider  than  the  activity  (minutes  to  hours) 
Suspect  IPs  and  ranges 

Filter 

Superfluous  port  traffic  (e.g.  80, 53, 25) 

IPs  that  are  unrelated  to  the  incident 
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Flow  Viewer  Performance 


Minimum  Frames  per  Second 


60 


Exceptional  30-60  fps 


48 


36 


Good  20-30  fps 


24 


12 


Acceptable  1 0-20  fps 


Unacceptable  <  10  fps 


#  of  visible  objects 


Graphics  performance  on  dual  1.5GHz  SPARC  SunFire  v440  with  Sun  XVR  1200 
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Flow  Viewer  Performance 


Real-time 

Performance 

Real-time 
Records  /  Hour 

Optimal  playback  rate 

Optimal 

IOK-30K/hour 

1  OX  Real-time 

Acceptable 

40K-l00K/hour 

Real-time 

Poor 

l00K-300K/hour 

1  / 1 0  X  Real-Time 

Sparse  data  sets  can  be  viewed  quickly 
e.g.  months  of  data  in  minutes 


Dense  data  sets  can  be  viewed  slowly  or  filtered 
e.g.  seconds  of  data  in  minutes 
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Knowledge  Depth  vs  Breadth 


What  trade-offs  are  we  making? 

-  Ul  Feedback? 

Haptic  vs  visual  feedback 

-  Data  access? 

Random  access  vs  linear  access 

-  Training? 

Under-learned  vs  over-learned 
Tool  complexity 

-  Meaning? 

Visual  semantic  vs  text 
Intuitive/Iconic  vs  cryptic/coded 
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Jeff  Han's  Multi-Touch  Screen  Interface.  Jeff  Kubina,  Flickr.com,  license:  http://creativecommons.Org/licenses/by-sa/2.0/deed.en 
Atari  joystick,  duncan,  Flickr.com,  license:  http://creativecommons.Org/licenses/by-nc/2.0/deed.en 
Headphones,  daxtoor,  Flickr.com,  license:  http://creativecommons.Org/licenses/by-sa/2.0/deed.en 
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SPAWAR 

Systems  Center 
San  Diego 


Next  Generation  Tactical  Situation 
Assessment  Technology 

(NG-TSAT) 


Objective:  Next-generation  Tactical  Chat.  Icon-based  situation 
assessment  (SA)language  supported  by  wireless  gesture- 
recognition  gloves  used  in  hostile  or  noisy  (silence-mandated) 
environments 

Description  of  Effort: 

1.  Linguistic  Analysis:  Analysis  of  current  C2  chat  logs  to 
determine  speech  patterns  and  repetitive  SA  concepts/themes 

2.  Iconic  Language  Development:  Output  of  linguistic  analysis 
determines  candidate  icons  representing  most  prevalent  SA 
“themes;”  development  of  prototype  C2  iconic  SA  language 

3.  Wireless,  Gesture-Recognition  Gloves:  Develop  wireless 
gloves  that  recognize  C2  icons/gestures  which  can  transmit 
across  network  to  distributed  warfighters  (replacing  keyboard 
input  when  in  MOPP) 

Benefits  of  TSAT: 

Compressed  Chit  (25%  |  content;  50%  |  reduction  in  production  time)  for  rapid  SA  dissemination. 
Gesture-recognition  in  very  noisy,  distributed  ops,  or  in  very  austere  environments  (e.g.,  the  moon) 

Challenges: 

1 .  No  current  method  or  theory  for  chat-meaning  compression;  currently  done  in  prose;  computer 
linguistic  analysis  of  unstructured  text  still  neoteric. 

2.  Wireless  gesture  recognition  glove  technology  still  in  infant  stages  of  development;  focused  on 
commercial  animation  support,  not  on  disciplined  language  support 

TRL:  Chat:  TRL  1-2;  Gesture-recognition:  TRL  1-4 


Major  Milestones  FY06: 

Linguistic  analysis  discovery  of  common  C2  SA  themes 
Development  of  icon/symbols  for  candidate  SA  themes 
Development  of  proof-of-concept  wireless  gesture-recognition  glove 

Period  of  Performance:  2007-2012 

PI  contact  info:  Dr.  LorRaine  Duffy,  (619)  553-9222, 
LorRaine.Duffy@navy.mil.  SSC  San  Diego,  CA 
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Sparklines 


Fingerspelling 
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Synaesthesia 


Synaesthesia:  "a  neurological  condition  in  which  two  or  more  senses 
are  coupled." 

"loud  color"  "sharp  laugh"  "bitter  wind" 


grapheme  color  synesthesia  -  letters  or  numbers  are  perceived 

as  inherently  olored 


How  many  numbers  contain  the  digit  6? 


9910  9972  3292  7602  62  9054 
5636  2710  1944  6330  6560  8101 
5177  1955  7029  4083  4643  5710 
4935  2256  1495  1025  8375  8518 
80  797  2610  3008  8784  1854  2383 
9728  4523  573  5914  7975  281 
6664  2682  7689  7753  273  5597 
799  9960  1437  4534  8601  4563 
6734  647  9409  6543  4827  2398 
1532 
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Is  this  easier? 


10  72  32  2  7602  &2  05  \  5636 

2710  1  44  6330  6560  8101  5177 
1  55  7029  4063  1643  5710  35 

2256  14  5  1025  8375  8518  80  797 
2610  3003  8784  1854  2383  728 

4523  573  5914  7  75  281  66&'  2682 
768  7753  273  5597  799  9&  60  1437 
4534  8601  4563  6734  647  940 
654  3  4827  23  8  1 532 
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Emulating  Synaesthesia 


These  methods  can  be  used  achieve 
sequence  disambiguation  and 


9910  9972  3292  7602  62  9054 
5636  2710  1944  6330  6560  8101 
5177  1955  7029  4083  4643  5710 
4935  2256  1495  1025  8375  8518 
80  797  2610  3008  8784  1854  2383 
9728  4523  573  5914  7975  281 
6664  2682  7689  7753  273  5597 
799  9960  1437  4534  8601  4563 
6734  647  9409  6543  4827  2398 
1532 


910  3292  7602  82  >4 

5636  2710  1944  6330  6560  8101 
5177  1955  7029  4083  4643  5710 
4935  2256  1495  1025  8375  8518 
80  797  2610  3008  8784  1854  2383 
4523  573  5914  7975  281 
6664  2682  7689  7753  273  5597 
799  1437  4534  8601  4563 

6734  647  6543  827  2398 

1532 
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Emulating  Synaesthesia 


1 92. 1 68. 1.232 

1 29. 1 68. 1.233 


92.  68.  .2  2 
29.  68.  .2 
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The  Ripple  decoded 


Carrie  Gates 

CA  Labs 

John  McHugh 

Canada  Research  Chair  in  Privacy  and  Security 

Dalhousie  University 
mchugh@cs . dal . ca 
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Inspiring  Minds 


Very  large  scale  observation 

•  Carrie  Gates  was  interested  in  the  degree  of  fan  out 
from  outside  to  inside  for  her  scan  detection  work. 

•  How  many  outside  hosts  use  exactly  one  inside 
host  /  service  pair,  (unique  destination  address/port) 

•  In  the  beginning,  we  did  it  the  hard  way,  but  Bloom 
filters  can  be  used  to  find  unique  slP,dlP,dport 
exemplar  flows 

•  If  we  make  a  source  IP  bag  from  the  exemplar  flows, 
the  counts  will  be  the  number  of  different  host  / 
service  pairs  contacted  by  a  given  source  host. 

•  Invert  the  bag  to  determine  how  many  entries  have  a 
count  of  1 , 2,  3, ... .  Plot  hourly  results  for  a  week 
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Outside  to  inside  -  July  2003 


Number  of  Unique  Source  IPs  that  Contacted  X  Destination  IPs  Per  Hour 

(jis  routed,  TCP  only) 


Unique  Source  IPs 


1  e+07 
1  e+06 
100000 
10000 
1000 
100 
10 
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Developing  the  contact  surface 


•  In  the  absence  of  the  disturbance  seen  on  the 
orevious  page,  contact  lines  seem  to  follow  a  power 
aw  type  of  distribution 

•  or  do  they1. 

•  We  think  this  is  really  at  least  3  separate 
processes 


•  VLF  noise 

•  “normal  activity” 

•  Bulk  scanning 


Number  of  Sources  that  Contacted  X  Destinations  Per  Hour 
(incoming  TCP  routed) 


Monday.  Sept.  15.  2003 
eA1 1 .763367  *  x*-1 .957496 


10  100 
LOG:  Number  of  Destinations 


1  everything  is  a  straight  line  on  log/log  paper,  especially  if  you  use  a  fat  marker 
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Internet  wide  disturbance 

•  The  ripple  in  what  would  otherwise  be  a  fairly  straight 
log/log  plot  of  connectivity  was  observed  from  at  least 
Jan  -  Aug  2003. 

•  It  went  away  when  Blaster  appeared  in  Aug  2003. 

•  A  similar  ripple  existed  from  Feb  1 1  to  May  31  2004 
coinciding  with  the  lifetime  of  Welchia-B 

•  In  this  case,  the  ripple  is  due  to  a  few  hundred 
machines  scanning  at  a  low,  fixed,  rate  induced  by 
a  loop  with  a  “sleep”  system  call. 

•  In  both  cases,  they  persisted  until  killed,  not  patched. 

•  We  have  been  told  that  the  ripple  is  back. 
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Details  of  the  Welchia.B  event  -  onset 


Number  of  Sources  tn-at  Contacted  x  Destinations 
•incoming  TCP  /exited  per  hour,  averaged  across  each  day) 
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Details  of  Welchia.B  -  demise 


Number  ot  Sources  that  Contacted  X  Destinations  Per  Hour  IAVG) 
(incoming  TCP  routed) 


20  40  60  60  100  120  140  160  180  200 

Number  of  Destinations 
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Design  Time  Coordination 

•  The  sleep  in  the  scan  loop  of  Welchia.B  points  to  a 
form  of  loose,  design  time,  coordination. 

•  All  members  of  the  cohort  scan  at  approximately  the 
same  rate,  using  the  same  random  generation 
scheme  but  with  a  different  random  seed. 

•  If  we  captured  all  the  scans  from  each  member  of  the 
cohort,  we  would  expect  to  see  a  small,  tight,  cluster 
of  scanners  all  contacting  nearly  the  same  number  of 
targets. 

•  We  observe  only  a  small  portion  of  the  address 
space  and  see  a  small  percentage  of  the  scans  from 
each  host  with  substantial  interhost  variation. 
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This  fall,  we  simulated  the  perturbations 

•  Generated  approximation  of  unperturbed  background 

•  Don’t  care  about  process,  only  appearance 

•  Simulated  perturbation  process  parameterized  on: 

•  Number  of  sources 

•  Probe  rate  /  source 

•  %  of  IPv4  monitored 

•  %  of  probes  intercepted 

•  For  ripple  or  wave,  %  monitored  =  %  intercepted 

•  For  scans  targeting  monitored  network  they  are  different 

•  Looked  at  observability  as  a  function  of  parameters. 
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Background  only  -  main  line  process 

Contact  Surface  for  24  hours,  4.0%  IPv4  monitored 
0  sources,  0  probes/hour,  4.0%  hit 
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Simulating  the  ripple 

•  For  each  source,  for  each  probe,  j  emitted  during 
an  observation  period; 

we  generate  a  random  in  {0..1.0}. 

•  If  R= :  is  <  the  %  of  IPv4  monitored,  it  is  a  hit. 

1  J 

•  Use  the  hit  count  to  select  the  appropriate  cell  in  the 
background  traffic  contact  line  and  add  1  to  it. 

•  source  Sj  hit  that  number  of  destinations  during 
the  simulated  observation  period  period. 

•  Plot  the  modified  contact  line  in  either  2D  or  as  part 
of  a  3D  contact  surface. 
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A  plausible  ripple 


Contact  Surface  for  24  hours,  4.690%  IPv4  monitored 
1000  sources,  1800  probes/hour,  4.690%  hit 


le+06 
100000 
10000 
1000 

Contact  Soiirces 
100 

10 

1 


*  *  w  * 

**  * 

*  *  *  *  * 

*  *  *  * 
********** 


Y  =  150618.  /  XA  1.9575 
Spread  =  40.0%,  Seed  =  123456 


0  Time  in  Hours 


Contact  Destinations 


1000 
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Observability:  1000  probers  /16  coverage 


o 


o 

u 

d 

c 

o 

u 


100000 


10000 


1000 


100 


10 


Contact  Surface  for  24  hours,  0.390%  IPv4  monitored 
1000  sources,  1800  probes/hour,  0.390%  hit 


Y  =  12524.8  /XM.9575 
Spread  =  40.0%,  Seed  =  123456 
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1000 
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Observability:  100  probers  12  X  /8  cover 


Contact  Surface  for  24  hours,  4.690%  IPv4  monitored 
100  sources,  1800  probes/hour,  4.690%  hit 
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Simulated  and  real  spikes. 

•  The  spikes  appear  when  the  percentage  of 
intercepted  probes  is  high. 

•  Occurs  when  the  probes  fall  mostly,  95%+,  in  the 
monitored  address  space. 

•  At  100%,  the  spike  becomes  a  point 

•  First,  we  simulate  the  spike. 

•  Next  is  a  one  month  contact  line  for  our  122,  based  on 
Bloom  filtering  for  unique  sIP,  dIP  pairs. 

•  Note  points  at  254,  508,  762  and  1016  addresses. 

•  Then  we  will  look  at  a  movie  for  14  months  on  the  122 
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The  spike  in  the  Welchia.B  displays 


Contact  Surface  for  24  hours,  4.690%  IPv4  monitored 
20  sources,  720  probes/hour,  95.0%  hit 


Contact  Destinations 
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Outside  Hosts 
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Contact  line  for  April  2006  for  a  111 

Contact  Surface:  2006/04/01  TOO  for  1  month. 
Bloom  filtered  for  unique  sIP,  dIP 


1  10  100  1000 
Inside  Hosts 
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Future  work 

•  We  would  like  to  visit  or  revisit  the  data  for  current 
and  past  perturbations. 

•  Develop  analytical  techniques  for  identifying  cohorts 
of  players  exhibiting  arbitrary,  but  similar 
characteristics. 

•  Explore  other  regions  of  the  contact  surface 

•  Link  visualization  to  source  /  cohort  identification  in 
the  visualization  tool  we  are  developing  for  DHS. 

•  and  always  remember ... 
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Greetings  from  Canada 


_>E  MAYS 
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Abstract  of  this  presentation 

Ideas  for  increasing  (optimizing)  performances  of  processes  in 
IPFIX 


Ideas  based  on  all  processes  using  an  order  rule  of 
Information  Elements/fields 


r 


< 


These  ideas  are  introduced: 

Method  for  reducing  the  number  of  comparisons  between  an  existing 
flow  and  an  incoming  new  packet  in  Metering  Processes  (MPs) 
(Comparison  method  for  multiple  fields  in  MPs) 

□  Method  for  reducing  the  number  of  copies  of  flow  records  from 
Metering  Process  to  Exporting  Processes  (EPs)  with  a  predefined 
order  of  fields 

(Copy  method  for  multiple  fields  in  EPs) 

Method  for  increasing  processing  speed  for  storing  data  in  incoming 
packets  to  file  with  a  predefined  format  of  Collecting  Processes  (CPs) 
(Copy  method  for  multiple  fields  in  CPs) 


These  are  basically  the  same. 
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Motivation  of  this  research 


■  Background 

□  Network  bandwidth  will  continue  to  increase. 

□  IPFIX  will  be  a  standard  protocol  for  flow 
information  exchange. 

■  Network  bandwidth  will  become  broader-band. 

□  Use  a  lower  sampling  rate.  _ n  However,  flow  information 

□  Use  fewer  Flow  Keys.  W  will  become  less  accurate. 


Research  on  increasing  (optimizing)  the 
performances  of  IPFIX  processes _ 
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IPFIX  features 

IPFIX 

Advantage:  Uses  Template-based  flexible  flow  export 
Disadvantage:  More  complex  than  fixed-format  protocol 

Comparison  of  processes  between  flexible  and  fixed 
formats, _ 


NetFlow  v5 

(fixed 

format) 


MP  reorders  fields 
of  observed  packet 
to  Flow  Record 
arranged  in  NFv5 
format 


EP  inserts  NFv5 
header  and  sends  it 


Observed 
packets 


CP  can  send  Flow 
Records  with 
normalized  format  to 
storage  by  removing 
NFv5  header 


\/ 

- ^ - 

Metering 

l-A|JUI  Lv  1 

IT — ; — 

Exporting! 

Data  Records 

+  Process 

pache-* 

Process 

^ollectinc 

Process 


IPFIX 

(flexible 

format) 


MP  reorders  fields 
of  observed  packet 
to  Flow  Record  of 
internal  format  in 
cache. 

The  format  depends 
on  implementation  of 
each  Exporter. 


EP  reorders  fields 
based  on 
configured 
Templates  and 
sends  it. 


If  CP  sends  Flow 
Records  with 
normalized  format 
to  storage,  it  must 
reorder  fields  of 
incoming  packets 


"^Storage 
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Our  approach:  Making  the  order  rule  for  Information  Elements 

■  Processes  of  IPFIX  have  a  high  possibility  of 
reordering  fields. 

□  Reducing  the  cost  of  reordering  fields  can  improve  their 

performance.  rn 

■  Our  approach  ^ 

□  Make  the  order  rule  for  Information  Elements 

■  Order  rule  gives  IPFIX  processes  chances  to  process 
multiple  fields. 

■  Processing  multiple  fields  at  a  time  achieves  higher 
performance  than  processing  one  field  at  a  time. 

■  The  rule  does  not  influence  the  flexibility  of  IPFIX. 


If  a  unified  order  rule  of  fields/IEs  is  defined, 
reordering  costs  can  be  reduced. _ 
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Idea  of  order 

■  Idea  of  order: 

□  MPs,  EPs  and  CPs  place  fields  (lEs)  in  the  same  order,  so  it 
is  highly  likely  that  multiple  fields  will  be  processed  at  a  time. 

■  This  reduces  reordering  costs. 

■  Order  recommended  in  this  presentation 

□  Place  fields  in  observed  packets  in  order  of  protocol  header. 

□  Therefore,  order  of  lEs  that  refer  to  packets  and 
header  fields  is  recommended. 


Metering  Processes 

Exporting  Processes 

Collecting  Processes 

Input 

Observed  packets 
(network  byte  order) 

Their  caches 

IPFIX  Data  Record 
(network  byte  order) 

Output 

(Storing)  their  caches 

IPFIX  Data  Record 
(network  byte  order) 

(Storing  files,  their  DB 
(real-time  analysis) 

NTT  Network  Servi 


Example  of  using  same  order  in  MP 

Flow  Keys:  sourcelPv4Address,  destination  I  Pv4Address,  sourceTram 


ratories,  NTT  Corporation 

EP  and  CP 

sourceT ransportPort,  destinationT ransportPort 


Good  (ideal)  case:  Same  suggested  order, 

which  refers  order  of  packet  header  fields  used  in  the  cache  in  Exporter  and  IPFIX  data  records 


v  IHL  TOS 

Total  Length 

ID 

F  Offset 

TTL  Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  1 

Pv4  Address  1 

Source  Port 

Destination  Pori 

UDP  Length 

JDP  Checksum 

Source  IPv4  Address 


sou  rcelPv4Ad  dress 

sourcelPv4Address 

destinationlPv4Address 

destinationlPv4Address 

sourceT  ransportPort  dstT  ransportPort 

sourceT  ransportPort  dstT  ransportPort 

v  IHL  TOS 

Total  Length 

ID 

F  Offset 

TTL  Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  1 

Pv4  Address 

Source  Port 

Destination  Pori 

UDP  Length 

JDP  Checksum 

Destination  IPv4  Address 

Source  IPv4  Address  sc 


sourcelPv4Address  ' 

1  dstT ransportPort 

sourceT  ransportPort 

urceTransportF 

destination! 

destinationlPv4Address 

i  .  a  a  _ ~  ~ 

irv^nuui  coo 

ustTransportPort  [ 

sourcelPv4Address 

Bad  case:  Different  order  used  in  the  cache  in  Exporter  and  IPFIX  data  records 


■  If  the  referential  order,  which  refers  to  the  order  of  packet  fields,  is 
defined,  it  could,  in  some  cases,  lead  to  increased  performance. 

■  If  a  referential  order  is  undefined,  there  is  no  possibility  of  increased 
performance. 
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1st  idea  to  improve  performance 
in  environment  in  which  MP,  EP,  and  CP  use  the  same  order 


Comparison  method  for  multiple  fields  in 

Metering  Processes  (MPs) 
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Comparison  method  for  multiple  fields  in  MP  (1) 


■MP  must  repeat  comparison  between  existing  Flow  Records  in 
its  cache  and  new  observed  packet. 

c  To  judge  whether  the  new  packet  belongs  to  a  new  flow  or  an  existing  one. 

■Basically,  in  this  comparison,  all  fields  (lEs)  serving  as  Flow 
Keys  are  compared  every  time. 

■If  fields  of  Flow  Records  are  placed  in  the  same 
order  as  packet  header  fields,  MP  can  compare 
multiple  fields  at  a  time _ 

Metering  Process 


Cache 

Observer} 

l 

Flow 

packet 

Record 

Flow 

Record 

‘  Flov 
Record 

Flow 

Record 

MP  repeats  comparisons 
and  finds  a  flow. 
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Comparison  method  for  multiple  fields  in  MP  (2) 

Example:  Flow  Key:  Version,  IHL,  TOS,  source  Address,  destination  Address 
All  fields  are  compared  every  time  (general  approach) 


10 


v  |IHL|  TOS 

Total  Length 

ID 

F  Offset 

TTL  Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Compare 


an  observed  packet 


A  Flow  Record  in  cache 


When  a  packet  arrives: 
5  comparisons 

1.  ip  version 

2.  IHL 

3.  TOS 

4.  Source  Address 

5.  Destination  Address 


Multiple  field  comparison  (our  approach) 

Premise:  Fields  of  Flow  Records  are  placed  in  the  referring  order  as  packet  header  fields 


f  f 

ff 

0000 

0000 

0  000 

00 

00 

0000 

ffffffff 

ffffffff 

v  IHL 

TOS 

Any  value 

v  |IHL|  TOS 

0000 

Any  value 

Any  value 

0000 

0  000 

Any  Val 

Any  val 

Any  value 

o 

o 

o 

o 

0000 

Source  IPv4  Address *  1 2 3 

Source  IPv4  Address 

Destination  IPv4  Address 

Destination  IPv4  Address 

Mask  created  when 
template  is  defined 


Observed  packet 


A  Flow  Record  in  cache 


v  IHL  TOS 

"  0000 

v  IHL  TOS 

0000 

00 

0  000 

0000 

0  000  I 

o 

o 

o 

o 

0000 

o 

o 

o 

o 

0000 

Source  IPv4  Address 

compare 

Source  IPv4  Address 

Destination  IPv4  Address 

Destination  IPv4  Address 

When  Template  is  defined: 
Create  a  Mask 

When  a  packet  arrives: 
Mask  the  packet 
And 

compare  these  memory 
areas  at  the  same  time 
(e.g.,  memcmp  in  C  language) 


Masked  observed  packet 


A  Flow  Record  in  cache 


Or 

1.  v+  IHL  +  TOS 

2.  Source  Address 

3.  Destination  Address 
(32-bit  architecture) 
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Comparison  method  for  multiple  fields  in  MP  (3) 

■  Number  of  operations  in  this  method 

Mask  costs  smaller  than  comparison  costs. 

□  Therefore,  this  method  is  effective  at  increasing  performance  by  reducing 
the  number  of  comparisons,  although  it  increases  mask  operations. 


Mask  creation 

Mask 

Comparison 

Number  of 
operations 

Once  in  an  IPFIX  session 
(when  Template  is  defined) 

Depends  on  the  number 
of  observed  packets 

(when  packet  arrives) 

Depends  on  the  number  of 
observed  packets  and 
number  of  flow  records  in 
cache 

◄ - ► 

less  more 


■  Effective  and  ineffective  cases 


v  IHL  TOS 

Total  Length 

ID 

F  Offset 

TTL  3rotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Effective  case: 

Flow  Keys  are  placed  densely 


v  IHL  TOS 

Total  Length 

ID 

F  Offset 

TTL  Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Ineffective  case: 

Flow  Keys  are  placed  sparsely. 
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2nd  idea  to  improve  performance 
in  environment  in  which  MP,  EP,  and  CP  use  the  same  order 


r 

t 

_ 

Copy  method  for  multiple  fields  in 

Exporting  Processes  (EPs) 
and  Collecting  Processes  (CPs) 
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Overview  of  copy  method  for  multiple  fields 
It  is  a  very  simple  method. 

□  If  fields  in  the  format  of  cache  and  lEs  in  exporting  Data 
Records  are  placed  in  the  same  order,  ERsJiave  a  chance 
to  copy  multiple  a  d  j  a  ce  n  tcgnsta  n  t-f  i  xed  - 1  e  nptEP  I  Es  at  a  time. 

□  If  lEs  in  received  Data  Records  and  fields  in  Collectors’ 
internal  format  to  store  Flow  Records  are  placed  in  the 
same  order.  CPs  have  a  chancy  to  copy  multiple  adjacent 
<donstant  fixed^iehqtQdEs  at  a  time  too. 


IE  size  classificatiorrof  IPFIX/(terminology  in  this  presentation) 

Protocol  specification  In  this  presentation 

Constant-fixed-length  IE 
(e.g.,  IP  Address) 


Fixed-length  IE 


Reduced-size-encoding  applicable  IE 
(e.g.,  counters) _ 


Variable-length  IE 

Variable-length  IE 

(octet  array,  strings) 

w 

(octet  array,  strings) 
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Example  of  copy  method  for  multiple  fields  in  EP 


■  Conditions  for  copying  multiple  fields 

Flow  Record  in  cache  and  Exporting  Data  Record  must  use  the  same 
order. 


IEs  must  have  a  constant  fixed  length. 

■  Almost  all  IE  characterizing  properties  of  flow  are  constant  fixed  length. 

□  Byte-orders  must  be  the  same. 

■  Observed  packet  and  Exporting  Data  Records  use  network  byte  order. 

□  IEs  for  copying  multiple  fields  must  be  adjacent. 


Flow  Record  in  cache 
Characteristic  properties  of  flow 


v  IHL I  TOS 


ID 


TTL  rProtocol 


JEotaMrenqitr 


Offset 


IP  Checksum 


Source  IPv4  Address 


Destination  IPv4  Address 


Source  Port 


Destination  Port 


UDP  Length 


UDP  Checksum 


Measured  properties  of  flow 


Flow  Start  Absolute  Time 


Flow  End  Absolute  Time 


Flow  Octet  Count 


Copy  one  fit 


at  a  time 


-Copy  multiple 


— FteWs^TaTirne 

Adjustleng'th 
Convert-byte  order 
, Calculate  time 
etc.'. . 


Exporting  Data  Record  L 


nClassf5jbrotoooT741 


source 


Pv4Address(8) 


destinatior 


Pv4Addressf  1 2)  |srcv4PrefLen  |dstv4PrefLen 


srcTransgotPort^TjJdstTransgortPort^ 


flowStartSysUpTlme(22) 


flowEndSysUpTime(21 ) 


octetDeltaCount(l) 


packetDeltaCount(2) 


Constant-fixed-length  IE 


Reduced-size-encodinq-applicable  IE 
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Evaluation  &  Conclusion 


NTT  Network  Service 


Doratories,  NTT  Corporation 


This  material  contains  an  evaluation  about  only  comparison  method. 

If  you  want  to  see  an  evaluation  about  copy  method,  please  see  a  material  I 
talked  in  past  IETF,  http://www3.ietf.org/proceedinqs/07jul/slides/ipfix-10.pdf. 


NTT  Network  Serv 


iratories,  NTT  Corporation 


Evaluation  of  comparison  method  for  multiple  fields 


30 


Processing  T  im  e 


□  s  ingle  ■  m  u  It  ip  h 


almost  the  same  27%  faster 


W/?,  -faster 


P+S A+D  A+SP+D  P 


TTL  + 

P+S A+D  A+SP+D  P 


V  +  1H  L  +  TO  S  +TTL  + 
P+SA+D A+SP+DP 


Fbw  Keys  and  Comparison  method 


V 

IHL 

TOS 

Total  Length 

ID 

F 

Offset 

TTL 

Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Source  Port 

Dst  Port 

V 

IHL 

TOS 

Total  Length 

ID 

F 

Offset 

TTL 

Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Source  Port 

Dst  Port 

V 

IHL 

TOS 

Total  Length 

ID 

F 

Offset 

TTL 

Drotoco 

IP  Checksum 

Source  IPv4  Address 

Destination  IPv4  Address 

Source  Port 

Dst  Port 

P+SA+DA+SP+DP  TTL+  V+IHL+TOS+TTL+ 

P+SA+DA+SP+DP  P+SA+DA+SP+DP 


■  When  the  density  of  Flow  Key  fields  is  higher, 
this  method  works  faster. 


'NTT  Network  Servi 


laboratories,  NTT  Corporation 


Computing  environment  for  the  evaluation 
■  Software  Exporter  program 


□  runs  on  Intel  Xeon  3.06  GHz  HT  architecture 

□  runs  on  Linux  (debian/gnu  Linux  4.0) 

□  compiled  by  gcc4 

■  optimized  option:  -03 


■  Data  used  as  observed  packets: 

□  PCAP  data  published  by  WIDE  project. 

□  contains  6,906,333  packets. 

□  ftp://mawi.nezu.wide.ad.jp/pub/mawi/samplepoint- 
B/20060303/2006030301 00.dump.gz 


'NTT  Network  Service 
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.aboratories,  NTT  Corporation 

Conclusion 

■  Introduced  ideas  to  improve  performances  of 
IPFIX  processes 

□  Comparison  method  for  multiple  fields  in  MPs 

□  Copy  method  for  multiple  fields  in  EPs,  and  CPs 

■  These  ideas  are  based  on  defining  the  order 
rule  of  lEs/fields 

□  Our  Recommendation:  lEs/fields  are  placed  in  the 
order  referring  to  the  packet  header  fields. 

■  The  order  rule  is  published  as  an  individual 
Internet  Draft 

□  http://tools.ietf.org/id/draft-irino-ipfix-ie-order-03.txt 

□  If  you  agree  with  these  ideas,  work  with  us. 


High  Level  Flow  Correlation 


Valentino  Crespi,  California  State  Los  Angeles,  CA 
Annarita  Giani.  UC  Berkeley,  CA 
Rajiv  Raghunarayan,  Cisco  Systems,  Inc. 


FloCon  2008,  Savannah  GA,  January  7-10,  2008. 


Outline 


1 .  Extension  of  previous  work  on  Flow  Aggregation, 
(Flocon  2006). 

2.  Embedding  of  network  traffic  in  a  Euclidian  Space. 

3.  Complex  modeling. 

4.  Planned  work. 
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Behind  Flow  Aggregation 


A 

Flow  Aggregates 


A 


FLOWS 

Thousands  per  hou 


PACKETS 


Hundreds  of  thousands  per  hour 


r  BYTES,  million  per  hour 

How  data  move 


•  Monitoring 

•  Anomaly  detection 

•  Security  analysis 

•  Traffic  profiling 

•  Debugging 

•  Traffic  engineering 

•  Usage-based  profiling 

•  Network  planning 

•  Pricing,  peering 


Data  Reduction  =  Fewer  events  to  be  analyzed 
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Process  based  analysis  of  flows 


We  believe  that  automated  correlation  at 
the  raw  flow  level  is  complicated  and 
susceptible  to  false  positives.  The  world 
consists  of  processes  so  our  approach  to 
correlation  is  process-based.. 

Implementation  of  a  PQS  based  process 
detection  for  Cyber  Situational  Awareness. 


Process  Query  System 


Observable  events  coming  from  sensors 


Models 


Model  Mi 


^,9 


ft' 


Hypothesis 

Likelihood  /.( 


Model  M> 


Model  Mk 

0& 


t 


Likelihood  i.i 


Likelihood  I.k 


MHT.  Reid  1979 


Giani,  De  Souza,  Berk,  Cybenko, "  Attribution  and  Aggregation 
of  Network  Flows  for  Security  Analysis  in  Proc.  Flocon  2006, 
Portland,  OR. 


Flow  +  Snort  Alerts 


Scenario:  several  packets  in  a  flow  triggered  IDS  alerts 


Snort  rule  1560 
generates  an  alert 
when  an  attempt 
is  made  to  exploit  a 
known  vulnerability 
in  a  web  server  or  a 
web  application. 


Snort  rule  1352 
generates  an  alert 
when  an  attempt  is 
made  to  access 
the  robots ,txtf  file 
directly. 
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T'lhk  2:  A  simple  (rack  of  muTclm-ml  IDS  ami  Flow 
events 


SNORT 

ALERTS 


FLOW 


The  flow  can  be  characterized  as  malicious  and  further  investigation  must  be  done. 


Flow  aggregation  and  correlations 
between  flow  data  with  security 
events 


The  idea  is  to  formulate  hypotheses  by  associating  new  observations  to  an  existing  pool  of  rated 
hypotheses  hi  all  the  possible  ways  and  then  calculate  the  new  rating  recursively. 
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Correlation  Engine  for  Computer  Security 
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Models  Sensors 


Sensors  and  Models 


A\  DIB:s 
A\  Snort,  Dragon 
/S\  IPtables 
/4\  Samba 
/5\  Flow  sensor 

/6\  ClamAV 
/7\  Tripwire 


Dartmouth  ICMP-T3  Bcc:  System 
Signature  Matching  IDS 
Linux  Netfi Iter  firewall,  log  based 
SMB  server  -  file  access  reporting 
Network  analysis 

Virus  scanner 

Host  filesystem  integrity  checker 


LUJ 

S3 

S3 


S3 


S3 


Noisy  Internet  Worm  Propagation  -  fast  scanning 

Email  Virus  Propagation  -  hosts  aggressively  send  emails 

Low&Slow  Stealthy  Scans  -  of  our  entire  network 

Unauthorized  Insider  Document  Access  -  insider  information  theft 

Multistage  Attack  -  several  penetrations,  inside  our  network 

DATA  movement 

TIER  2  models 

FloCon  2008,  Savannah  GA  ,  Jan  7-10,  2008 


RECON 


Example  -  Phishing  Attack  Model 

/*\  /a 


ATTEMPT  UPLOAD 


UPLOAD  ATTEMPT 

\y  vy 
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Current  aggregators  and  analyzers 


•  POWERFUL  TOOLS  to  understand  the  behavior  of  the  network  according 
to  certain  parameters,  e.g.  the  amount  of  resources  consumed,  the 
variance  on  the  various  characteristics  of  the  communication  (source  ip, 
destination  ip),  port. 

•  PROBLEM:  They  do  not  provide  an  analysis  and  a  description  of  the 
dynamic  evolution  of  network  traffic. 

•  NEED  for  a  structure  that  summarizes  the  behavior  of  the  network. 

OUR  IDEA 

Combine  flow  aggregation  techniques  with  our  previous  process-based 
approach: 

Use  aggregators  and  flow  analyzers  to  translate  traffic  into  a  process  to 
be  modeled  and  estimated. 


FloCon  2008,  Savannah  GA  ,  Jan  7-10,  2008 
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Build  circuits  of  Aggregating  gates 

1 .  Place  observing  nodes  in  multiple  locations  of  the  network  (e.g.  on  each 
local  router). 

2.  Each  observing  nodes  dumps  traffic  flows  to  a  Macro  Aggregator. 

3.  Macro  Aggregator:  circuit.  Each  gate  is  a  flow  aggregator 

■  First  layer  consists  of  classical  aggregators  that  output  flow 
aggregates.  Successive  layers  process  aggregates  of  flow 
aggregates 

■  Final  output:  a  vector  function  of  the  dumped  traffic  ranging  in  Rn: 

X(<)  =  ( (<),  x2  (t ),  •  • ,  x„  (t )) 

At  each  time  the  observing  nodes  produce  a  set  of  vectors: 

S(r)  =  {*,(<), X2(4---, *„(')} 

1 .  Identify  and  Analyze  properties  of  S(t)  over  time  to  characterize/detect 
anomalies. 
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Embed  Traffic  in  Euclidean  Space 


Packets 

IN 


jackets 

IN 


Flows 


Source  IP 

Destl 

J 

Protocol 

AG-SIP 

AG-DIP 

AG-Prot 

Entropy  Alerts 


AG-H 


AG-Alerts 


i  I 


Packets 

IN 


Network  Device 

qj 


jackets 

IN 


AG-Final 

AG-Final 

I 


MA 


Source  IP  DestTF 


Flows 


Protocol 


AG-SIP 


AG-DIP 


AG-Prot 


Entropy  Alerts 


AG-H 


AG-Alerts 


I  i 


Xi  (0  =  Of,l  (*)»  */,2  (0.  • •  •  • .  *,>  (f ))  X  J  (t)  =  (Xj,  (<),  xJ2  (t  ),■■■,  xJn  (t)) 

(Entropy  S-IP, Entropy  D-IP,  Average  Size,...,%TCP  Traffic, %UDP  Traffic) 


S(0  =  x,(t),...,Xj(t),...} 
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Aggregating  flows  (1) 

Flows  of  security  attacks  usually  have  common  patterns  and  form  conspicuous  traffic 
clusters.  Identifies  clusters  of  attack  flows  in  real  time  and  aggregates  those  large 
number  of  short  attack  flows  into  few  metaflows. 

Same  sourcelP  ~  worm  propagation 

Same  destIP  ~  Denial  of  Service  Attack  Purpose  is  mostly  security. 

Same  destIP  and  SourcelP  ~  most  portscan 

Example: 


H(X)  =  -YJP[X  =  ip,]  log P[ X  =  ip, ] 

i 

APP  =  min  {#(*)} 

X  e{dstIP ,  srcPort } 


X;  =  APP 

Entropy  Based  Flow  Aggregation  (2006) 

Yan  Hu,  Dah-Ming  Chiu,  and  John  C.S.  Lui 
The  Chinese  University  of  Hong  Kong 

FloCon  2008,  Savannah  GA  ,  Jan  7-10,  2008 


Keys:  srclP,  dstIP,  srcPort  (Proto),  dstPort  (Proto) 

Properties  of  clusters  containing  attack  traffic: 

•  Fixed  value  in  one  of  the  keys. 

•  Number  of  flows  in  cluster  large. 

•  Size  of  flows  in  cluster  small. 


(srclP, 


Random  Dimensions 
*  * 


dstPort  (Proto) 


Fixed  Keys 


Aggregation  Priority  Parameter 
X  =  dstIP 


Aggregating  flows  (2) 


A  small  percentage  of  flows  consume  most  of  the  network  bandwidth. 


Study  of  heavy  flows  in  4  orthogonal  dimensions: 

•  Size 

•  Duration 

•  Rate 

Flow  is  divided  in  bins  bi  of  duration  T. 

•  BurstineSS  Burstiness  is  the  standard  deviation  of  bi 

and  examine  their  correlations. 

The  flow  can  be: 

•  Elephant  -  large  size  flows 

•  Tortoise  -  high  duration  flows 

•  Cheetah  -  large  rate  flows 

•  Porcupine  -  high  burstiness 


Traffic 

%  of  Elephant  =  E 
%  of  Tortoise  =  T 
%  of  Cheetah  =  C 
%  of  Porcupine  =  P 


x,  =  (E,  T,  C,  P ) 


On  the  correlation  of  Internet  flow  characteristics  (2003) 

Kun-Chan  Lan,  John  Heidemann 

Information  Science  Institute,  University  of  Southern  California 
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Analyze  S(t)  over  time 

Our  Interest:  Study  the  evolution  of  correlations  of  flow  aggregate 
attributes  over  time  in  order  to  detect/estimate  anomalies  and  attacks. 

S(t)  =  {Xl(t),X2(t),-,Xn{t)} 

•  Build  models  (continuous,  discrete  event,  probabilistic,  etc.)  to  describe 
the  observation  process  S(t). 

•  Track  the  unknown  states  of  the  built  model  given  S(t). 

•  Apply  Learning  Techniques  to  learn  models. 

•  Use  Tracking  Machines  to  estimate  the  hidden  state  sequence  given  the 
observables  and  the  models. 


FloCon  2008,  Savannah  GA  ,  Jan  7-10,  2008 
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Metric  of  traffic  features 


S(4  =  {*,(4*2(4->*„('» 

Use  clustering  techniques  (e.g.,  spectral  clustering,  k-means  based 
algorithms,  etc.)  to  clusterize  the  observing  nodes  and  detect  “shifts” 
of  traffic  characteristics  between  different  observation  nodes: 

1 .  Study  how  clusters  change  over  time  and  characterize/detect 
anomalies. 

2.  Use  clusters  to  produce  a  graphic  representation  of  the  traffic. 

3.  Define  discrete  models  to  describe  the  evolution  of  clusters  in 
relation  to  specific  events:  coordinated  computer  attacks, 
presence  of  covert  channels,  bugs  in  the  network  software, 
hardware  breakdowns,  etc. 
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Spectral  Clustering 


Input:  Similarity  Matrix  M=[aij],  ,  number  k>0 


ay  =  six,- ,  Xj )  e.g  atj  =  exp(- 


Xi-Xj 


2cr2) 


•  Build  similarity  graph.  For  example  the  Graph 
whose  adjacency  matrix  AG  =  M. 

•  L  =  Laplacian(  AG  ) 

•  Compute  the  k  eigenvectors  of  L  associated 
with  the  k  smallest  eigenvalues:  vl,  v2,...,vk 

•  V  =  [vl  v2  ...  vk],  nxk  matrix 

•  Pick  the  rows  of  V:  yl ,  y2 . yn 

•  Cluster  yi’s  using  k-means  algorithm  into 
C1,C2,...Ck 

Output:  clusters  C1,C2,...,Ck 
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Discrete  Models  of  Cluster  Evolution 


Time  Tv 


TimeT 


k+1 


No  attacks 


PROBLEM:  characterize 
anomalous  unobservable 
events  associated  with 
transitions  between 
clusters. 


Attack 


Idea:  Build  Deterministic  Finite  State  Automata  models  to  identify 
transitions.  In  this  case  we  identify  anomalies  by  studying  the 
current  clustering  in  relation  to  the  previous  “snapshot”  of  traffic 


FloCon  2008,  Savannah  GA  ,  Jan  7-10,  2008 
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Challenges 


Parameter  estimation:  in  our  example  of  clustering  k 
was  fixed. 

Define  and  learn  models  of  the  system’s  dynamics. 

Identify  relevant  attributes  of  flow  aggregators  to 
obtain  significant  vectors. 

Define  appropriate  similarity  function. 

Use  a  realistic  Data  Set  to  verify  approaches. 
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Planned  Work 


Implement  clustering  method. 

Develop  discrete  models. 

Build  a  software  monitor  to  analyze  traffic 
through  clusters  and  vector  representation. 
Experimental  analysis  of  the  efficaciousness 
of  our  approach. 
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Complex  Phishing  Attack  Observables 
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Identifying  Anomalous  Traffic 
Using  Delta  Traffic 


Tsuvoshi  KONDOH  and  Keisuke  ISHIBASHI 
Information  Sharing  Platform  Labs. 
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Outline 

•  Background  and  Motivation 

-  Identifying  anomalous  traffic  is  the  missing  piece. 

•  Our  Technique:  DELTAA 

-  Concepts 

1 .  Extract  anomalous  traffic  as  the  delta  of  normal  and  anomalous 
time  periods. 

2.  Auto-aggregate  extracted  anomalous  traffic. 

-  Operation  of  our  technique 

•  Show  the  step  by  step  operation  of  our  technique. 

•  Evaluation 

-  Evaluation  using  synthesized  DDoS  traffic. 

•  Summary 
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Background  and  Motivation  Time  series  of  total  traffic  by  bps 

A  bps 

•  Monitoring  of  traffic  volumes  is  widely 
used  for  network  operation  (e.g.  MRTG). 

•  Many  techniques  for  detecting 
anomalous  volume  change  have  been 
proposed  (NBAD,  Holt-winters  in  MRTG, 

...  etc.). 

•  Some  tools  to  mitigate  damage  from 
anomalous  traffic,  (e.g.  drop/rate  limit  at 
router,  detour  to  Cisco  Guard,  etc.) 


Auto-detect 


However,  accurate  mitigation  needs 
accurate  ACLs  (ACL  set). 

But,  Generating  accurate  ACL  set 
requires  manual  drill  down  by  operator. 

-  It’s  too  costly. 


Manual  drill  down  of 
anomalous  traffic 

Time  series  of  protocol  composition 

▲ 

bps 


time 

Time  series  of  dst  port  composition 


bps 


time 


<-  443(https 
S  dns 
pother 


Our  Technique:  DELTAA 

•  DELTAA  outputs  ACL  set  for  filtering  or  rate  limiting  to 
mitigate  the  damage  from  anomalous  traffic. 

-  DELTAA:  Delta  Traffic  Automatic  Aggregator 

•  Three  concepts  of  DELTAA: 

1 .  Reveal  anomalous  traffic  using  delta  traffic,  between  normal 
and  anomalous  periods. 

2.  Aggregate  delta  traffic  and  generate  optimized  ACL  set  on 
single  dimensions. 

-  Dimension  means  source  IP  address,  destination  IP  address, 
protocol  or  port  numbers. 

3.  Generate  multi-dimensional  ACL  set  by  integrating  each 
single  dimensional  ACL  set. 
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Concept  #1 : 

(1)  Definition  of  “Normal”  and  “Anomalous”  Traffic 

Throughout  this  presentation,  I  use  the  following  definitions. 

1 .  Anomalous  traffic:  Traffic  that  causes  a  change  in  traffic 
volume  (bps/pps/fps). 

-  BitTorrent  and  server  intrusion  are  out  of  scope  because  they 
always  exist  or  do  not  cause  a  change  in  traffic  volume. 

2.  Normal  period:  Period  when  traffic  volume  is  normal. 

3.  Anomalous  period:  Period  when  traffic  volume  is  anomalous. 


bpsA 
1  rT 


<-  Anomalous  traffic 


500 


<-  Normal  traffic 


normal  anomalous 
period  period 
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Concept  #1  : 

(2)  Reveal  Anomalous  Traffic 

•  Make  two  assumptions 

1 .  traffic  of  normal  period  =  normal  traffic 

2.  traffic  of  anomalous  period  =  normal  traffic  +  anomalous  traffic 

•  anomalous  traffic  =  traffic  of  anomalous  period  -  traffic  of  normal  period 


<-  Anomalous  traffic 


<-  Normal  traffic 


\7  \7  time 

normal  anomalous  normal  anomalous 
period  period  period  traffic 


Extracting  anomalous  traffic 
from  “traffic  of  anomalous  period 
is  difficult  because  it  is  a  mixture 
of  normal  and  anomalous  traffic. 


r 


i 


/  Takini 
<  “traffi 


Taking  the  delta  between 
traffic  of  normal  period” 
and  that  of  anomalous 
period,  we  can  effectively 
extract  anomalous  traffic. 


Concept  #2: 

Auto-aqqreqate  Delta  Traffic 

•  Our  technique  expresses  anomalous  traffic  with  some 
number  of  ranges. 

-  For  example  source  IP  address  ranges. 

-  The  ranges  should  be  optimal  for  filtering. 


Example  1 

ACL  set  for  covering 

all  anomalous  traffic 
normal  traffic  anomalous  traffic 
src_ip  0.0.0.0*4, 


ACL(1)  will 
filter  out 
normal  traffic, 
as  a  false 
positive. 


ter  range  selection 

Example  2  % 

litting  ACL  range  \ 

id  collateral  damage  ^ 
I  traffic  anomalous  traffic 

ACL(1) 


Why  this  selection 
is  better? 


ACL(2) 


\ 

I 


255.255.255.255 


255.255.255.255 
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Criteria  of  “Goodness” 


•  We  introduce  three  criteria  of  identification. 


1.  Coverage  ratio:  ( 1  -  FNR  ) 

Maximize  filtered  anomalous  traffic 


2.  Collateral  (damage)  ratio:  (  FPR  ) 

Minimize  filtered  (normal)  legitimate  traffic 

3.  Number  of  ACLs: 

ACL  entry  budget  is  limited,  so  fewer  ACLs  is  better. 


These  three  criteria  have  a  trade-off  relationship  with 
each  other. 


bpsi 


anomal 


How  to  decide  the 
best  ACL  set? 


Uncertain  whether  to  use 
r  not,  because  it  will 
le  collateral  damage. 


Good! 

Good! 

...  probably  Bad 
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Evaluation  Formula  for  Goodness 


To  decide  the  best  ACL  set,  we  introduce  this  formula: 

coverage :  cov,  collateral  ratio  :  coll ,  no.  of  ACLs :  n 
(P  -a )  +a  •  cov  •  coll 


rate  = 


n 


(a  JB,y  :  weighting  coefficients) 


Use 

-alpha=1 
-beta=2 
-gamma=0.1 
for  this  example^ 


ACL(1) 


ACL(2) 


-  By  tuning  the  weighting  coefficients,  we  can  reflect  network 
policies  or  customer  requirements.  ^ 

Example  1  y  ^  Example  2  ^  ^ 

rate=  1.31  >  rate=1.57 

coverage=1 00%,  collateral=30%,  no.  of  ACLs=2  cQveraqe=95%,  collateral  0%,  no.  of  AClS=3 

- 1 -  /  '  T - ACL,N 

I  I  ACL(2)I 

\  I  / 

/ 

^(3) 


Step  by  Step  Explanation  of  Our  Technique 


•  Following  seven  pages  show  the  step  by  step  operation  of 
our  technique  including  above  two  concepts  and  concept  #3. 
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Stepl:  (1)  Counting  Up 


Count  traffic  of  both  normal  and  anomalous  periods  for  each  source  IP 
address. 


normal  period  =600  Mbps  anomalous  period  =1  Gbps 
src_ip  0.0.0.0 1  *  I  *  bps 


Count 
traffic 
volume  for 
each 

source  IP 
address. 


Step  1:  (2)  Making  Delta  Traffic 

Make  delta  traffic  by  subtracting  traffic  of  normal  period  from  that  of 
anomalous  period. 


500  M 


DELTAA  obtains 
anomalous  traffic  with 
granularity  of  source  IP 
addresses  as  delta  traffic. 


normal  period  =600  Mbps 
src_ip  0.0.0.01 


255.255.255.255 


anomalous  period  =1  Gbps 
bps 


Subtract  for 
each  source  IP 
address. 


Anomalous  traffic 
=400  Mbps 


12 


Step  2:  (1)  Deciding  ACL  Set  as  IP  Address  Ranges 


•  When  using  anomalous  traffic  information  only,  collateral  damage  cannot 
be  avoided. 

-  Causes  miss-filtering  of  normal  traffic. 

•  So,  we  need  to  use  information  on  both  normal  and  anomalous  traffic. 


normal 

traffic 


collateral  damage 
to  normal  traffic 


Don’tcare 

normal 
traffic 


anomalous 

traffic 

- ► 


ACL(1) 


ACL(2) 
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Step  2:  (2)  Building  Tree  of  Normal  and  Anomalous 
Traffic 

Making  Traffic  Tree 

•  Build  up  from  individual  source  IP  addresses  (depth=32). 

•  Each  node  has  information  about  coverage  and  collateral  ratio. 

-  Collateral  ratio:  normal  traffic  of  the  node  total  normal  traffic 

-  Coverage  ratio:  anomalous  traffic  of  the  node  total  anomalous  traffic 

•  Make  parent  nodes  by  merging  child  node  information. 


depth=32 

anomalous  divided  into  depth=4  depth=3  depth=2  depth=1  depth=0 


Step  3:  Selecting  Best  Node  Set  from  Traffic  Tree 

1 .  To  reduce  search  space,  delete  unnecessary  nodes. 

-  Unnecessary  node:  node  which  having  little  coverage  ratio  (little  anomalous 
traffic)  or  little  difference  from  its  descendant  nodes. 

2.  Search  best  node  combination  by  applying  the  formula  for  all  non¬ 
overlap  node  combinations  in  a  brute  force  way. 

-  Best  node  combination  =  Best  ACL  set  for  source  IP  dimension 


normal 

traffic 


anomalous 

traffic 

► 


depth=32 
distinct  src_ip 

ACL(1)=0.0.0.0/3 
cov=12.5,  col=5 


depth=4 
divided  16 


depth=3  depth=2 
divided  5  divided  4 


m 


ACL(2)=64.0. 0.0/2 


cov=25.5,  co 


=5 


6/0 


6/5 


6/15 


6/5 


6/0 


6/0 


6/5 


6/0 


0.0.0.0/3 

12.5/5 


32.0.0.0/3 

12.5/20 


b> 


0.0.0.0/2 

25/25 


64.0.0.0/3 

12.5/0 

96.0.0.0/3 

12.5/5 


depth=1 
divided  2 


0.0.0.0/1 

50/30 


64.0.0.0/2 

25/5 


ACL(3)=1 92.0.0.0/3 

35/0 

> 

192.0.0.0/3 

cov=50,  col=5 

15/5 

50/5 

depth=0 
non  divided 


0.0.0.010 

100/100 


Delete  unnecessary 
nodes  having  little 
^coverage  ratio. 

Example  2.  rate=  0.29  :  Ptartadhpcyiggi^ft  clixBf HhEarfiMc 

atitihitiUtetteHsad^ndgated  to  /3  (depth=3)  15 


Concept  #3  Generate  Multi-Dimensional  ACL  Set 


Generate  single  dimensional  ACL  set  in  parallel. 

-  ‘source  IP’,  ‘destination  IP’ ,  ‘protocol’,  ‘source  port’  and  ‘destination  port’ 

Make  candidates  of  multi-dimensional  ACL  sets  as  a  product  sets  of  each 
dimension. 

Count  anomalous/normal  traffic  for  every  candidates. 

Select  best  combination  of  candidates  in  terms  of  goodness  score. 


src_ip  dimension 

projected  bps 

◄- 


255.255.255.255 


ACL_srcip(2) 


ACL_srcip(1) 


o.o.o.o 


ACL(2);_|  [|aCL(3) 


(candidate) 


(candidate) 


ACL(1) 

(candidate) 


O 

O 

O 

O 


</> 

"D  ■— 1 


o 

< 


ACL(4) 

(candidate) 

£ 

O 


og  LO 

-  LO 


</> 

£ 

O 


Q.  _ 

i 

■0^-0 
I  m  w 

.1  m  q_ 


src_ip  dimension 

projected  bps 

◄- 


255.255.255.255 


Counting 
anomalous 
and  normal 
traffic  per 
candidates. 


ACL_srcip(2) 


ACL_srcip(1) 


o.o.o.o 


Finally, 
DELTAA 
outputs  best 
Multi¬ 
dimensional 
ACL  set. 


V  ACL(2)  □;! 


ACL(1) 


7 

i 

i 

% 

% 


O 

< 


LO 

CM 

LO 

LO 

CM 


J 

(0 

•a 


O 

O 

o 

o 


little<-  traffic  volume  (bps)  ->  high 


V) 

■a 


o 

< 


\ 


ACL(4)| 

/ 

1 


a  8  a> 

oi)  ^  -i 

S  'O 

■  LO  Q 

H  -- 

q  lo  i 

^  CM  (0 


Evaluation  and  Results:  Test  Data  Set 


•  Normal  traffic:  publicly  available  traffic  data,  captured  on 
transpacific  line  (100  Mbps) 

•  Anomalous  traffic:  injected  synthesized  DDoS  attack  traffic 

-  Mimic  large  DDoS  attack 

•  We  choose  source/destination  IP  addresses  that  have  large 
normal  traffic,  because  simple  identification  would  cause  collateral. 

•  Destination:  Popular  server  appeared  in  normal  traffic 

•  Source:  Choose  IP  address  block  (/1 6)  from  which  volume  of 
normal  traffic  to  the  destination  is  largest. 

•  Test  how  well  our  technique  can  extract  the  injected 
anomalous  traffic. 

Use  the  “weighting  coefficients” 

-alpha=1  (weight  for  coverage) 

-beta=1 0,000  (weight  for  collateral) 

-gamma=0.0001  (weight  for  no.  of  ACLs) 
to  avoid  collateral  damage 


M  bps 


Evaluation  and  Results:  Results  (1) 


•  Results:  We  get  four  ACLs  (Four  ACLs  are  one  set.) 

-  coverage:  93.75% 

-  collateral:  0.00% 


Time  series  of  traffic  with  output  ACLs  displayed  in  separate  colors 
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Evaluation  and  Results:  Results  (2)  OUTPUT 


basetime_len=  60.0  (sec) :  (1168362060.0  -  1168362120.0)  basic  information 

anomtime _len=  60.0  (sec) :  (1 1 683621 80.0  -  1 1 68362240.0) 

base_total_bps=  89,121,539.5 

anom_total_bps=  137,729,812.7 

diff_total_bps=  48,608,273.2 

+54.5  % 

1-D  OUTPUT:  PROTOCOL=  6 

coverage=  100.42  collateral  95.52  single  dimensional 

1-D  OUTPUT:  SRC  PORT=  high 

coverage=  108.27  collateral  33.42  identification 

1-D  OUTPUT:  DST  PORT=  high 

coverage=  100.09  collateral  96.40 

results 

1-D  OUTPUT:  SRCJP 

coverage=  96.43  collateral  0.00 

119.170.0.0/17 

coverage=  51.43  collateral  0.00 

119.170.128.0/18 

coverage=  25.72  collateral  0.00 

119.170.192.0/19 

coverage=  12.86  collateral  0.00 

119.170.240.0/20 

coverage=  6.43  collateral  0.00 

1-D  OUTPUT:  DSTJP 

coverage=  102.93  collateral  2.17 

134.45.182.70/32 

coverage=  102.93  collateral  2.17 

MULTI-DIMENSION_FLOW_OUTPUT  coverage=  96.43  collateral  0.00 

flowlD_0:  cov=  51.43  col=  0.00: 

119.170.0.0/17  134.45.182.70/32  6  high 

high 

flowlD_1 :  cov=  25.72  col=  0.00: 

119.170.128.0/18  134.45.182.70/32  6  high 

high 

flowlD_2:  cov=  12.86  col=  0.00: 

119.170.192.0/19  134.45.182.70/32  6  high 

high 

flowlD_3:  cov=  6.43  col=  0.00: 

119.170.240.0/20  134.45.182.70/32  6  high 

high 

T  coverage  T  collateral:  T  srcjp  T  dst_ip  protocl  scr_port  dst_port 

1 9 


Evaluation  and  Results  (3):  Destination  IP  Tree 
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Evaluation  and  Results  (4):  Source  IP  Tree 
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Summary 

•  Introduced  three  criteria  of  optimal  ACL  set. 

-  for  mitigating  DDoS  attacks  on  router 

•  Proposed  DELTAA  technique:  Optimizes  trade-off  among  the 
these  criteria,  using  normal  and  anomalous  traffic. 


•  Presented  an  example  of  applying  DELTAA  to  extract 
injected  anomalous  traffic. 

-  Evaluation  results  using  prototype  and  synthesized  data  set. 


Thank  you. 

Any  questions  are  welcome. 


tsuyoshi.kondoh  [at]  lab.ntt.co.jp 
ishibashi.keisuke  [at]  lab.ntt.co.jp 

This  study  was  supported  by 
the  Ministry  of  Internal  Affairs  and  Communications  of  Japan. 


Q:  Will  Calculation  Complexity  Be  Explosion? 


•  The  way  of  making  single  dimensional  tree  and 
compressing  way  is  similar  to  Estan’s  way  in  [Automatically]. 

•  So,  number  of  nodes  on  compressed  tree  is  limited, 

•  We  can  Search  all  non-overlap  node  combinations  in  a 
brute  force  way  within  realistic  time  and  resource. 


[Automatically]:  C.  Estan,  S.  Savage  and  G.  Varghese,  “Automatically  Inferring  Patterns 
of  Resource  Consumption  in  Network  Traffic,”  SIGCOMM,  August  2003. 


Copyright  NTT  Information  Sharing  Platform  Laboratories,  2008 


24 


YAF 

A  Case  Study  in  Flow  Meter  Design 


presented  at 

FloCon  2008  -  Savannah,  Georgia 


===•  Software  Engineering  Institute 


Brian  Trammell 

Technical  Lead,  Engineering 
CERT  Network  Situational  Awareness 


Carnegie  Mellon 


©  2008  Carnegie  Mellon  University 


YAF 


Open-source,  IPFIX-compliant  bidirectional  flow  meter 

•  Available  from  http://tools.netsa.cert.org 

Processes  packets  from  multiple  inputs 

•  libpcap  dumpfiles  (ad-hoc  packet  analysis) 

•  libpcap  live  capture  (including  proprietary  pcap  interfaces,  e.g.  Bivio) 

•  Endace  DAG  live  capture 

Performance  is  network  hardware  and  I/O  bound... 

•  ...easily  handles  0C3,  0C12,  GigE  at  line  speed,  but 

•  lOGigE  requires  proprietary  hardware  at  saturation. 


Software  Engineering  institute 


Carnegie  Mellon 


©  2008  Carnegie  Mellon  University  2 


Flow  Meter  Design 


,CEFTC 


Software  Engineering  Institute  Carnegie  Mellon 


©2008  Carnegie  Mellon  University 
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Flow  Meter  Effects  on  Flow  Data 


Fragmentation 
End  Conditions 
Timeouts 
Delta  Counters 
Biflows 

The  Packet  Clock 


Software  Engineering  institute 


Carnegie  Mellon 
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Fragmentation 

Three  approaches  for  flowing  fragmented  traffic: 

•  pretend  there’s  no  such  thing  as  fragmentation, 

•  drop  all  fragmented  packets,  or 

•  full  or  partial  fragment  reassembly 

Each  approach  has  tradeoffs,  and  is  applicable  in 
certain  situations. 

YAF  supports  partial  reassembly. 


Software  Engineering  institute 


Carnegie  Mellon 
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Fragmentation? 


Easiest  way  to  handle  fragmentation:  don’t. 

Leads  to  inaccurate  flow  data  as  subsequent  fragment  port 
numbers  are  incorrectly  decoded: 


ip  hdr  A 1  (X,Y) 

sp  X 

dp  Y 

payload  A 1 

ip  hdr  A2  (X,Y) 

p 

lyload  A2 

ip  hdr  B 1  (X,Y) 

sp  X 

dp  Y 

payload  B 1 

ip  hdr  B2  (X,Y) 

P 

i yload  B2 

ports 


sip 

dip 

proto 

sp 

dp 

pkts 

X.X.X.X 

Y. Y. Y. Y 

6 

X 

Y 

2 

X.X.X.X 

Y. Y. Y. Y 

6 

A20 

A22 

1 

X.X.X.X 

Y. Y. Y. Y 

6 

B20 

B22 

1 

,CEFTC 
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Fragmentation?  (2) 


Often  used  in  resource-restricted  environments  (e.g., 
routers). 

•  Much  faster:  no  requirement  even  to  recognize 
fragmented  packets. 

•  Much  less  memory  consumption:  no  fragment  table. 

•  Less  susceptible  to  resource  exhaustion  attacks. 

Trivially  easy  to  implement. 

Difficult  or  impossible  to  recover  actual  flows  from 
random  fragment  offset  port  data. 


Software  Engineering  institute 


Carnegie  Mellon 
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Dropping  fragmented  packets 


Requires  minimal  resources  at  flow  meter: 

•  need  to  recognize  fragments,  but  not  store  them. 

Leads  to  meter  blindness: 

•  all  an  attacker  must  do  to  hide  from  the  measurement 
infrastructure  is  fragment  all  packets. 

Only  applicable  behind  perimeter  devices  which  also 
drop  all  fragmented  packets. 


sip 

dip 

proto  sp 

dp 

pkts 

[no  flows] 

Software  Engineering  institute 


Carnegie  Mellon 
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Partial  fragment  reassembly 


Associate  each  fragmented  packet  with  its  actual 
transport  ports: 


,CEFTC 


Software  Engineering  Institute  Carnegie  Mellon 
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Partial  fragment  reassembly  (2) 

Accurately  assigns  fragments  to  respective  flows. 
Requires  additional  resources  at  flow  meter: 

•  need  to  recognize,  look  up,  and  store  every  fragment. 

More  difficult  to  implement  and  maintain. 

Requires  care  to  avoid  vulnerability  to  resource 
exhaustion  attacks. 


Software  Engineering  institute 


Carnegie  Mellon 


©  2008  Carnegie  Mellon  University  10 


Flow  End  Conditions 

Flow  meter  must  recognize  actual  connection 
shutdown... 

•  . .  .through  varying  degrees  of  modeling  the  host  TCP 
state  machine. 

Flows  on  the  wire  are  not  always  so  well-behaved. 
Example:  multiple-RST  teardown. 


Software  Engineering  institute 


Carnegie  Mellon 
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Multiple  RST  teardown 

How  many  flows  here? 


sip 

dip 

flags 

sp 

dp 

pkts 

Y. Y. Y. Y 

x.x.x.x 

SAF 

X 

Y 

6 

Y. Y. Y. Y 

x.x.x.x 

SAF 

Y 

X 

3 

Y. Y. Y. Y 

x.x.x.x 

R 

Y 

X 

1 

,CEFTC 
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Multiple  RST  teardown  (2) 


Tempting  to  group  RSTs  on  teardown  into  original 
flow. . . 

•  . . .how  long  to  keep  closed  flow  state? 

•  . .  .how  far  to  take  this  RST  grouping? 

•  . . .how  to  communicate  new  configuration  parameters  to 
analysts? 

YAF  stays  predictable,  at  the  expense  of  generating 
multiple  flow  records  for  this  behavior. 


Software  Engineering  institute 
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Passive  Timeouts 


Flows  which  have  no  packets  over  7"Opassive  seconds 
are  closed. 

Necessary  to  terminate  flows  for  all  non-connection- 
oriented  transports, 

•  i.e.,  anything  but  TCP. 

Longer  passive  timeouts  consolidate  low-frequency 
periodic  activity  into  fewer  flows. 

Shorter  passive  timeouts  reduce  flow  table  resource 
consumption  for  such  activity. 


Software  Engineering  institute 


Carnegie  Mellon 
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Passive  timeouts  (2) 


Generally  chosen  to  match  common  protocol  timeouts... 

•  ...  which  are  generally  round  numbers,  e.g.,  10,  30,  60  sec. 

May  be  chosen  to  avoid  flow  closure  ambiguity  due  to  minor 
variations: 

•  e.g.,  12,  33,  64  sec. 


flow  A',  (l2sTOp J 

flow  A,  (10s  TOpassiJ 

1  Is 

A2 

A  ios  A  ios  A  ios  A  ios  A  ios  A  IQs  A  ios  A 

A  9s  A 

time 


A’ 


2 


I  Os 
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Active  Timeouts 


Flows  which  have  been  open  for  rOactive  seconds  are 
closed. 

•  Maximum  flow  duration  is  7"Oactive  seconds. 

Necessary  to  ensure  long-lived  flows  are  eventually 
flushed  from  the  flow  table. 

Active  timeout  determines  reporting  delay. 
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Active  Timeouts  (2) 

Shorter  active  timeouts  used  for  more  rapid  reporting. 
Longer  active  timeouts  used  for  better  data  reduction. 


flow  A’ ,  (90s  TOartiv(,) 

A’, 

A,  (30s  TOactjve)  A2 

A, 

a4 

A  i  Os  A  i  Os  A  i  Os  A  i  Os  A  ios  A  ios  A 

ios  A  i Os  A  i Os  A 

1 5s  A  10s  A  ^ 

> 


time 
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Delta  Counters 


Flow  meters  which  periodically  emit  multiple  flow  records  per 
flow  (for  rapid  reporting)  may  use  total  or  delta  counters. 

Total  counters  replace  values  in  previous  flow  records. 

Delta  counters  add  to  values  in  previous  flow  records... 

•  ...thereby  reducing  state  requirements  on  meter  and  increasing 
them  on  collector. 

YAF  uses  total  counters,  but  doesn’t  emit  multiple  records  per 
flow... 

•  . .  .uses  active  timeout  instead. 
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Biflows 


Representation  of  two  sides  of  a  connection  with  a  single  flow 
record : 

•  Allows  additional  data  reduction 

•  Enables  easier  connection  analysis 

•  Improves  flow  state  modeling  at  flow  meter 

YAF  is  a  biflow  meter,  but  SiLK  stores  uniflows. 


src  (X) 

dst  (Y) 

counters/values 

src  (Y) 

dst  (X) 

counters/values 

src  (X) 

dst  (Y) 

fwd  counters/values 

rev  counters/values 
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The  Packet  Clock 


Important  to  drive  all  processes  within  a  flow  meter 
with  a  single  clock 

•  fragment  timeouts,  flow  timeouts,  time  stamping,  etc. 

When  building  a  flow  meter,  gettimeofday(2)  is  not 
your  friend. 

•  often  a  problem  with  porting  host-based  software  into  a 
network-based  monitoring  environment 

Use  the  timestamp  from  the  packet  instead! 

•  ensures  that  the  resulting  flow  stream  identical  whether 
captured  live  or  generated  from  dumpfile. 
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Getting  YAF 

http://tools.netsa.cert.org 

Builds  on  Mac  OS  X,  Linux,  BSD,  Solaris 

•  Bug  reports  from  these  or  other  Unices  welcome! 

Some  prerequisites 

•  glib-2.0  (C  modernization  layer) 

•  libairframe  (application  utility  library  from  NetSA) 

•  libfixbuf  (IPFIX  protocol  implementation  from  NetSA) 

•  libpcap  (generally  available  on  most  modern  Unices) 

•  libdag  (only  required  for  Endace  DAG  capture) 
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Questions? 

Ask  now... 

...or  later: 

•  Brian  Trammell  <bht@cert.org> 

•  Chris  Inacio  <inacio@cert.org> 
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Network  Analysis  of  Point  of 
Sale  System  Compromises 

Operation  Terminal  Guidance 

Chicago  Electronic  &  Financial  Crimes 

Task  Force 
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Background 

Hypothesis 

Deployment  Methodology 
Data  Analysis 
Findings 
Discussion 


Hypothesis:  Remote  attackers  were  not 
targeting  point  of  sale  (POS)  system 
software,  rather  POS  system 
compromises  are  a  result  of  insecure 
deployment  of  the  underlying  operating 
system  by  automated  scanning  and 
vulnerability  exploitation 
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Association  rules 

-  Clustering 

•  T:  Number  of  virtual  POS  systems  with  connection 
attempts  from  a  single  source 

•  n,:  Number  of  packets  from  a  source  to  a  virtual 
POS  system 

•  N:  Total  number  of  packets  from  a  source  to  all 
three  POS  systems 

•  N=I  n, 

Support(R)  =  #  connections  (POS  system  A,  B,  and  C) 

#connections 


Data  analysis  methodology  from 
i^ouget  and  M.  Dacier.  “Honeypot  Based  Forensics.” 
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Control  Group  Clusters 
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Edit  Distance  Analysis 

-  Extract  TCP  payloads 
from  previous  identified 
cluster  members 

-  Compare  packets  from 
each  IP  address 
against  all  others 
identified  through 
clustering 
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Cluster 

Port 

Phrase  Distance  (Lines) 

Std  Deviation 

Cluster  2 

1026 

324 

238 

Cluster  5 

1394 

360 

85 

Cluster  6 

1394 

280 

170 

Cluster  7 

1394 

529 

136 

Cluster  8 

1394 

1422 

1143 

Cluster  1 1 

5900 

240 

257 

***Clusters  1,3,4,9,10  were  discarded  as  not  statistically  significant 


63530 


|CDSBB7^ 


«j50@01  66.251.114 


-J3 

Q 

o_£ 
q  a 


|  >  » 

u  cj1 

L>  U> 
U  L» 


ata  Analysis 


Network  T raffic  Overview 
POS  Al-  Control  Group 


Packet  Ethertype  |p  Version  IP 
Length  Header 

Length 


Differential  Total 
Services  Length 


I  / if  \4  i If/M  I ' aaifeaiA 

i//  v|  ■  /  % 

\/ 

\i /  24-89-fll9B39l,i0 

IP  TTL  IP  '  IP  IP  IP  TCP 

Fragment  Transport  Header  Source  Destination  Source 

Protocol  Checksum  Address  Address  Port 


TCP  Seq  UDP  UDP 

Destination  Number  Source  Destination 

Port  Port  Port 


methodology  from  Greg  Conti’s.  “Security  Data  Visualization.” 


I  I—1 

vJ*  xht 


2JMH.W  f- 


'6e-j|eije0S)Bi0e- 

Gl:-2^m.36 


TCP 

Source 


Source 


TCP 

Destinatio 


Source 


TCP  Destination 


I  U 

V  U  [)/* 


UJJSj 


rsJLZLj 

f  'j/j 


”*  _ 7*  _T!PS*-  _  1  ^ 


•  The  TCP  outlier  is 
associated  with 
browsing  public  web 
site  to  ensure 
connectivity 

•  Uniform  length  of 
packets 
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•  Examination  of  the  UDP  packets  identified 
in  the  previous  tree  map  revealed  them  to 
be  spam  targeting  messenger  applications 
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•  Automated  scanning  of  select  set  of  ports 


•  Multiple  exploits  targeting  multiple  OS’s 
from  single  source  IP  address 

•  Attackers  not  aware  compromised  system 
is  a  POS  system  until  after  compromise 
and  exploit 

•  Insecure  installation  of  operating  system 
and  applications  lead  to  compromise 
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Presentation  Summary 


This  presentation  will  profile  the  result  of  the  growth  in  peer-to-peer  applications  on  a 
sample  network  and  describe  the  resultant  massive  increase  in  the  diversity  of  traffic.  This 
diversity  impacts  the  ability  to  profile  baseline  normative  behaviour  using  Blind  Flow 
Analysis. 

I  will  also  briefly  discuss  the  application  of  SiLKtools,  Neural  Networks  and  Bioinformatic 
strategies  to  Blind  Flow  Analysis  of  real  world  security  problems  and  how  that  analysis  is 
affected  by  the  growth  in  recreational/user  driven  applications. 

What  began  as  a  basic  design  principal  of  end-to-end  management  with  popular 
applications  in  recreational  computing  is  quickly  becoming  a  dominant  evolutionary  force  in 
network  traffic  patterns. 

Traffic  patterns  are  becoming  emergent  properties  influenced  by  the  voluntary  adoption  of 
new  systems  by  individuals  without  any  collective  intent. 

The  network  is  evolving  at  the  edges. 

“Peer-to-Peer  is  the  basic  design  of  the  Internet”  -  Christian  Huitema 


Sample  Network  Description 


•  A  Multi-tenant  Commercial  Network  consisting  of: 

-  40  user  assigned  hosts,  actual  number  subject  to 
minor  fluctuations  over  time. 

— 40  special  hosts  not  assigned  to  individual  users. 
These  hosts  form  parts  of  various  temporary 
development  and  experimental  environments. 

-  Users  were  apprised  that  Network  flow  data  was  now 
being  captured  for  experimental  and  management 
reasons. 

-  Payload  data  was  neither  collected  nor  examined. 

-  Analysts  did  not  have  access  to  the  content  of  specific 
hosts  for  further  investigation. 

-  For  confidentiality  reasons  the  identity  of  the  Network  is 
not  specified  in  this  Presentation. 


A  Review  of  Blind  Flow  Analysis 


The  Need  for  Classification  Based  on  Minimal  Information  (the 
extreme  case  in  the  world  of  tomorrow) 

Capturing  and  examining  payload  contents  is  widely  viewed  as  a  potential  violation  of 
privacy  and  placed  in  a  category  similar  to  listening  in  on  a  telephone  call. 

Even  attempts  to  use  information  derived  from  the  payload  (such  as  ngrams)  do  little  to 
alleviate  the  fundamental  concern  of  the  user  surrounding  access  to  the  payload. 

In  multi-tenant  commercial  environments  this  user  concern  may  be  based  in  protection 
of  commercial  confidentiality. 

There  is  less  (although  not  zero)  concern  among  the  user  community  with  regard  to  the 
capture  and  investigation  of  packet  header  data  (some  concern  for  Source  and 
Destination  IP’s  and  MAC’s). 

Therefore,  the  network  analyst  may  be  limited  to  examining  a  severely  reduced  subset  of 
the  packet  header  information  in  an  attempt  to  determine  if  the  system  under  their 
management  (or  monitoring)  is  operating  properly  or  experiencing  anomalous  behavior. 

The  loss  of  access  to  the  originating  address  information  means  that  the  analyst  no 
longer  has  access  to  a  unique  field  in  the  data  that  identifies  the  individual  hosts  in  the 
traffic  (i.e.  they  cannot  tell  one  computer  from  another  by  looking  at  the  remaining  flow 
record  traffic  alone). 

In  such  an  environment,  what  is  required  is  a  method  of  classification  that  relies  on 
minimal  information  and  the  development  of  traffic  flow  behaviour  models  that  use  only 
this  information. 


One  Strategy  for  Comparing  A  Suspicious  Host  to 
a  Standard  Workstation  Using  Blind  Flow  Analysis 


Local  Baseline  Workstation  Behaviour  (BWB)  Suspicious  Host 


Bytes  Transferred  in  one  month  <  20  million  per  month 


45  billion  per  month 


Internal  DIPs  <  10  per  month 
External  DIPs  <  20  per  month 


3  per  month 
1 .74  million  per  month 
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Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


•  In  early  2006  Neural  Network  was  used  to  classify 
workstation  traffic  based  on  a  localized  “Workstation 
Genome”. 

•  It  was  found  workstation  behaviour  could  be  fully 
described  by  a  set  of  23  unique  3-tuples  formed  by  the 
combination  of  Protocol,  Destination  Port,  and  Byte 
Range  ID  -  Where  Byte  Range  ID  was  one  of  five  levels 
given  by: 


Bytes 
0-100 
100-999 
1000-9,999 
10,000-49,999 
50,000  + 


2 

3 

4 

5 


Range 


Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


50  Hidden 
Nodes 


Each  input  frequency  vector  contains  an  observed  frequency  for  each  3-tuple 
for  a  24  hour  period. 

Each  3-tuple  is  defined  as  Protocol,  Destination  Port,  Byte  Range. 

All  observed  Workstations  could  be  described  by  a  23  element  Vector. 


Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


Host  ID 

Day 

Output  Vector 

Classification 
(Hit/Mis  s/Unknown 

1  [010] 

1 

[0.04  0.86  0.08] 

HIT 

2 

[0.17  0.97  0.00] 

HIT 

3 

[0.10  0.91  0.021 

HIT 

4 

[0.09  0.95  0.01] 

HIT 

2  [10  0] 

1 

[0.95  0.06  0.00] 

HIT 

2 

[0.96  0.04  0.001 

HIT 

3 

[0.95  0.06  0.001 

HIT 

4 

[0.95  0.07  0.001 

HIT 

3  [0  0  1] 

1 

[0.00  0.09  0.92] 

HIT 

2 

[0.00  0.00  0.991 

HIT 

3 

[0.00  0.12  0.921 

HIT 

4 

[0.00  0.00  0.991 

HIT 

100%  Success  rate  on  uniquely  classifying  a  small  sample  of  the  population 


Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


•  In  early  2007  a  similar  population  of  workstations  was  chosen  with 
the  goal  of  testing  a  Support  Vector  Machine  approach  to 
classification. 

•  To  the  great  surprise  of  the  author,  the  number  of  unique 
3-tuples  required  to  uniquely  describe  the  Workstation 
Genome  had  risen  from  23  to  over  600  in  1 6  months. 


•  Subsequent  investigation  showed  that  the  diversity  of  the  observed 
behaviour  increased  as  a  function  of  both  population  size  as  well  as 
the  length  of  the  sampling  period. 


Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


By  limiting  the  traffic  to  ICMP  and  TCP  flow  records,  the  number  of  unique  tuples  required  to 
adequately  describe  the  population  reached  a  steady  state  of  approximately  18%  of  the  total 
number  of  all  expressed  tuples. 

When  UDP  traffic  was  introduced  into  the  sample,  the  percentage  of  unique  tuples  in  the 
population  did  not  reach  a  steady  state  in  proportionality  but  rather  the  number  of  the  unique 
tuples  increased  in  linear  proportion  to  the  number  of  total  tuples  observed. 


Impact  of  Peering  Traffic  on  Blind  Flow  Analysis 
and  the  Uniqueness  of  Minimal  Information 


•  What  happened  to  the  network  traffic  to  create  such  diversity  in  such  a 
short  period  of  time? 

•  Expected  monthly  unique  destination  IPs  =1200  (40  hosts  *  30 
external  and  internal  DIP  contacts). 

Actual  values: 

Average  monthly  destination  IPs  =  140,000 
Average  monthly  number  of  flows  =  2.8  million 
Average  monthly  byte  volume  of  approximately  31  billion 

•  In  addition  to  unusual  volumes,  two  fundamental  behaviours  changed. 

-  Protocol  Ratio 

•  From  TCP  70%  UDP  30% 

•  To  TCP  50%  UDP  50% 

-  Use  of  Unique  Destination  Ports  by  Workstations  now  parallels  Server 
behaviour. 


One  Year  of  Peer-to-Peer 


Much  has  been  written  lately  of  the  growth  and  deployment  of  Peer-to-Peer 
Protocols 

Recommended  reading  “Transport  Layer  Identification  of  P2P  Traffic ”, 
Thomas  Karagiannis,  et  al,  IMC’  04,  2004,  Taorimina,  Italy. 

Perhaps  Peer-to-Peer  is  the  culprit. 

Decided  to  check  for  the  presence  of  known  P2P  in  the  traffic 

eDonkey2000 

Fasttrack 

Bittorent 

Gnutella 

MP2P 


One  Year  of  Peer-to-Peer 


Protocol  Flows  By  Month  (nw) 


Month 


The  graph  above  shows  the  pattern  of  flows  by  protocol  for  one  year  for  the 
Target  network. 


One  Year  of  P a er-to-Pear 


One  Year  of  Peer-to-Peer 


Destination  IPS  per  Month 


Months 


DIP'S  per  month 


For  a  small  network  they  talked  to  quite  a  few  friends. 


One  Year  of  Peer-to-Peer 


The  feeling  was  mutual. 


One  Year  of  Peer-to-Peer 


Let’s  consider  the  traffic  contribution  for  each  P2P  Application  in  the  table. 


One  Year  of  Peer-to-Peer 


MP2P,  or  Manolito,  is  a  P2P  system  primarily 
used  to  share  music  files.  MP2P  traffic  was  the 
least  contributor  to  the  overall  network  traffic 
among  the  observed  systems.  This  traffic 
reached  a  peak  flow  count  of  just  under  160  in 
January  2007. 


One  Year  of  Peer-to-Peer 


The  Fasttrack  P2P  system  is  primarily  used  by 
Kazaa  and  its  variants  to  exchange  mp3  music 
files.  Fasttrack  traffic  reached  a  peak  flow 
count  of  2,500  in  July  2006. 


One  Year  of  Peer-to-Peer 


eDonkey2000  Flows 
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EDonkey2000  was  a  peer-to-peer 
system  primarily  used  to  distribute 
large  images,  video  games  and 
software.  Although  officially 
discontinued  in  September  2005  due 
to  legal  action  brought  by  the 
Recording  Industry  Association  of 
America  (RIAA),  we  speculate,  based 
on  our  profiling,  that  we  observed 
eDonkey2000  communication  during 
2006.  EDonkey  traffic  passed  25,000 
flows  in  July  2006. 


One  Year  of  Peer-to-Peer 


Gnutella  is  a  multi-tier  Peer  based  file 
exchange  system.  Traffic  from 
Gnutella  ranged  from  5,000  to  35,000 
flows  per  month. 


Gnutella  Flows 
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Gnutella  Flows 


One  Year  of  Peer-to-Peer 
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One  Year  of  Peer-to-Peer 


Unfortunately  the  overall  Peer-to-Peer  flow  pattern  did  not  match 
the  pattern  that  we  were  seeking.  That  being  a  50/50  ratio  of  TCP 
to  UDP. 


Protocol  Flows  By  Month  (nw) 


Month 
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One  Year  of  Peer-to-Peer 


The  graph  above  shows  the  pattern  for  which  we  were  searching.  This 
is  the  traffic  from  a  single  user  workstation,  with  a  peak  flow  count  of 
50,000  flows  per  month. 


One  Year  of  Peer-to-Peer 


One  Year  of  Peer-to-Peer 


Destination  IP's 


-♦ —  DIP'S 


Months 


This  workstation  changed  its  behaviour  in  late  fall  2006  from  talking  to 
less  than  100  DIPs  per  month  to  6,000  DIPs  per  month. 


One  Year  of  Peer-to-Peer 


Flows  by  Protocol 


Months 
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Destination  IP's 


Who  am  I  ? 


One  Year  of  Peer-to-Peer 


SKYPE 


This  traffic  pattern  is  driven  by  the  adoption  of  Voip  by  a  single  user  in  the 
target  network. 


Disclaimer:  It  is  important  to  point  out  that  since  the  experimenter  had  no 
access  to  the  actual  machine  or  payload  data  this  conclusion  is  simply 
conjecture  based  on  known  user  Behaviour  within  the  target  network. 
(Skype  is  a  wonderful  App) 


Observations  on  Traffic  for 
Clients  and  Peers 


•  Consumes  considerable  Resources. 

•  Represents  an  Application  Level  WAN  Network 
for  Communication. 

•  Provides  a  channel  to  hide  Malicious  Activity. 


“McAfee  suggested  hackers  were  likely  to  create  malicious  software  to  target 
instant  messaging  services ,  Voice  over  Internet  Protocol  (VoIP)  telephony 
services  and  online  gaming  sites."  Hackers  will  target  social  networking  sites: 
security  firms  -  Thursday,  November29,  2007,  CBC  News  http://www.cbc.ca 


Evidence  that  all  is  not  as  it 

Appears 


•  One  day  in  February  a  conversation  took 
place  between  a  user  host  on  the  Network 
and  a  host  compromised  by  an  on-line 
game  server. 

•  Two  hours  later  the  user  host  was 
attempting  to  contact  a  few  friends.... 
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We  Need  to  Re-Consider  our 
Willingness  to  be  a  Peer 


•  Users  willingly  download  and  install 
client/peer/server  software. 

•  They  even  participate  in  strategies  to  avoid 
barriers  and  impediments  (like  Nat’ing). 

•  There  is  an  implied  trust  that  the  communication 
is  exclusively  what  it  claims  to  be. 

•  “When  they  thought  they  were  playing  at  war 
craft,  they  were  actually  playing  at  war  craft.” 


Concluding  Notes 


•  The  network  is  evolving  at  the  edges 

•  This  means  that  network  architectures, 
management  and  provisioning  strategies 
are  now  more  responsive  then  ever. 

•  Global  communication  resources  are 
primarily  influenced  by  the  uncoordinated 
activities  of  individuals. 

•  Traffic  patterns  are  emergent  properties 
without  intent. 


Future  Work 


•  Study  the  growth  in  diversity  of  patterns  in  traffic. 

•  Study  the  form  and  distribution  of  applications  and  participants. 

•  Track  Unidentified  Anomalies. 

•  February  2008,  TARA  will  announce  the  InTARA  project 

Intelligent  Network  Traffic  Analyzers  for  Reconstructive  and  Real  Time  Analysis 


•  InTARA  will  be  a  multi-million  dollar,  multi-year  project  to  develop 
intelligent  traffic  analysis  capabilities  for  the  good  guys. 

•  We  are  seeking  global  collaborative  research  and  commercialization 
partners.  Early  stage  interest  from  Australia,  India,  Switzerland, 
Canada. 
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Introduction 


Initial  Activity  in  many  intrusions 

-  Scanning 

Techniques  to  detect  these  initial  scans 

One  of  the  effective  algorithms 
-Threshold  Random  Walk 
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Introduction  (contd.) 


Challenges  when  using  TRW 

-  UDP  and  ICMP  Traffic 

-  Repetitive  Scanning 

-  Slow  and  Stealthy  Scans 

Using  Bloom  filters 

-  eliminate  repetitive  input  to  TRW 

-  look  for  reverse  matches  in  time  ordered 
data 
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Threshold  Random  Walk 


Scan  Detection  Algorithm  based  on 
sequential  hypothesis  testing. 


Uses  a  positive  reward  based  scan  detection. 

-  For  a  given  host,  records  connection  attempt 


made 
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Threshold  Random  Walk 


The  ratio  is  calculated  as  : 

*  ™  =  Pr[V|F,]  ?i[Yj\Ht] 

‘  V  }  ~  Pr[r|ff0]  i-IPr[Yi|tf0] 

Where  the  probabilities  are  : 

Pr[Y,  =  0|i?o]  =  6a,  Pr[Y  =  l|iT0]  =  1  -  % 

Pr[Y  =  0|i?i]  =  61,  Prp5  =  l|Hi]  =  l-fli 

-  Y  =  success  (0)  or  failed  (1)  connection  attempt 

-  HO  =  benign  hypothesis 

-  HI  =  scanner  hypothesis 

-  00  =  probability  that  the  source  is  benign,  for  a  successful  connection 
attempt 

-  01  =  probability  that  the  source  is  scanner  for  a  successful 
connection  attempt 
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Threshold  Random  Walk 


The  thresholds  are  calculated  based  on 

-  desired  true  positive  ((B  =  0.99) 

-  desired  false  positive  (a  =  0.01 ) 
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Bloom  Filter 


It’s  a  Data  Structure 

-  test  the  membership  of  an 
element  for  a  given  set 

Definition  of  the  Structure 

-  bit  array  of  m  bits 

-  k  different  hash  functions 

-  Hash  functions  maps  a  key 
value  to  one  of  the  m  array 
positions. 
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Bloom  Filter 


Properties  : 

-  False  positives  possible 

-  No  false  negatives 

-  Elements  can  be  added 

-  No  deletion  possible 

-  Greater  the  number  of  elements,  higher  the 
probability  of  false  positives. 

-  Space  Efficient 

-  Cannot  determine  the  elements  present  in  it. 
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Modified  TRW  with  Bloom  Filter 


raw  hit  or  miss  definition 

-  For  a  given  pair  in  the  flow  record 
eg  {sip,  dip} 

•  HIT  =  if  a  corresponding  entry  {dip,  sip,  sport, 
dport,  proto}  is  found  within  a  specified  timeout 
period 

•  MISS  =  if  a  corresponding  entry  {dip,  sip,  sport, 
dport,  proto}  is  not  found  within  a  specified 
timeout  period 
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Modified  TRW  with  Bloom  Filter 


Bloom  Filter  uses  10  hash  functions  and  a  bit 
vector  of  size  2A32 

Experiment  Set  up  : 

-  Pass  the  flow  records  through  the  bloom  filter. 

-  Specify  selection  criteria:  {sip,  dip},  {sip,  dip, 
proto},  {sip,  dip,  sport},  {sip,  dip,  dport},  {sip,  dip, 
sport,  dport,  proto} 

-  Use  the  TRW  scanning  algorithm. 
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Modified  TRW  with  Bloom  Filter 


Specify  Unique  Criteria: 
SP  or  SDP  or  SDSP  or 
SDDP  or  SDSDP 


Flow 

Records 


Bloom 

Filter 


Unique 

Entries 


Modified 
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The  Dataset 


A  year  long  trace  collected  on  a  / 22 
enterprise  network 

Using  Silk  Tools 

Internal  Network  Hosts 

-Total  Address  Space  =  1024 

-  #Active  hosts  in  a  given  day  =  varies 
between  60-70 

-  Active  Address  Space  ~  6% 
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The  Dataset 


Outlps  Seen 


EtoO 

OtoE 

Non  Responsive 
Out  ips 

%  Non  Responsive  Out  ips 

Feb 

26680 

7270 

19410 

72.75112444 

Mar 

30232 

3866 

26366 

87.21222546 

Apr 

56126 

14576 

41550 

74.02986138 

May 

2355612 

106893 

2248719 

95.46219836 

June 

2847371 

283270 

2564101 

90.05152472 

July 

2601834 

246312 

2355522 

90.53313932 

Aug 

30181 

29097 

1084 

3.591663629 

Sept 

126913 

126549 

364 

0.28681065 

Oct 

330740 

277438 

53302 

16.11598234 

Nov 

4050 

2932 

1118 

27.60493827 

Dec 

2226535 

254484 

1972051 

88.57040199 

Total 

10636274 

1352687 

9283587 

87.28232274 
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Number 
o-l  IPs 


The  Dataset 
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The  Dataset 
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Problems  faced  during  Analysis 


Time  granularity 

-  millisecond  not  available. 

-  The  order  of  flow  records  for  the  same 
second  is  the  outside  to  inside  put  first. 

Background  noise  in  the  traffic. 

ICMP  ping  traffic  causes  false  detection. 
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Problems  faced  during  Analysis 


"bet  thresholds  final.txt"  u  1 :2:3 


100 


LogScale 
TRW  Ratio  1 
between  the 
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0.01 
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Preliminary  Results 


TRW  Parameters  used: 

-  Thetal  determined  based  on  the  %active  internal 
hosts  compared  to  the  total  address  space  ~ 
0.0654 

-  ThetaO  ~  0.8 

•  Changed  thetaO  for  benign  hosts  to  hits  /  (hits  +  miss) 

•  The  value  of  new  thetaO  ranged  from  0.45  to  1 .00 

•  All  benign  hosts  still  classified  as  benign 

-  Alpha  (desired  false  positive)  =  0.01 

-  Beta  (desired  true  positive)  =  0.99 
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Preliminary  Results 


Flows  per  Month 
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Preliminary  Results 


Scanner  Detected 
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Likelihood  Ratio  LogScale 


Preliminary  Results 
Plot  of  Likelihood  ration  for  Scanners 


Time 
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Likelihood  Ratio  LogScale 


Preliminary  Results 
Plot  of  Likelihood  ration  for  Can’t  Says 
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Preliminary  Results 
Plot  of  Likelihood  ration  for  Benign 


100 

10 

1 

0.1 

0.01 

0,001 

1e-04 

1e-05 

1e-06 
1.35e+06 


1.35056+06  1.35le+06  1,35l5e+06  1,352e+06  1,3525e+06  1.353e+06  1.3535e+06  1,354e+06 

Time 


"Li  ke_Be  n  i  g  n_wTH  B_No_i  p .  txt" 


DALHOUSIE 

UNIVERSITY 


Inspiring  Minds 


Initial  Conclusions 


Using  Bloom  filter,  reduces  the  false 
positives,  (  by  how  much  ?  ) 

-  unique  entries  considered  for  a  given  filter 
criteria 

Using  specific  filter  criteria  for  the  bloom 
filter 

-  detects  vertical  scanning 
-detects  horizontal  scanning 
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Further  Work  In  Progress 

Need  to  improve  the  technique  by 

-  Vary  thetaO  and  thetal  values 

-  Effect  of  timeout  period 

-  Real  time  scenario 

Long  term  analysis  of  IPs  toggling 
between  the  three  regions 

-  Esp.  from  scanning  to  Can’t  say  or  benign 
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Thank  you 


Questions  ? 
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Project  Objectives 


Analysis  of  Wireless  Network  Data  from  University  of 
Dartmouth  (Crawdad  Archive) 

Adding  MAC  Layer  information  in  Net  Flow  tools  for 
identification  of  nodes  and  Activities  performed  by  a  node. 

Return  converted  flow  data  to  the  Crawdad  archive. 
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Proj  ect  Rationale 


The  main  issue  in  analyzing  wireless  network  data  from 
many  environments  is  the  assignment  of  temporary  IP 
Addresses  using  DHCP  with  short  leases. 

The  total  user  population  often  exceeds  the  available  address 
space,  and  a  given  user  may  connect  to  the  network  for  short 
sessions  from  a  number  of  different  locations  making 
complicating  per  platform  analyses. 

Work  to  date  has  concentrated  on  mobility  rather  than 
platform  behaviour. 
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The  Data 


160  GB  of  compressed  tcpdump  packet  headers. 

Collected  continuously  from  2  Nov  04  -  28  Feb  04 

1 8  collection  points  academic,  library,  residence 

Nothing  beyond  IP  Headers  except  TCP  ports  and  flags, 

UDP  ports. 

Anonymized  with  prefix  preserving  technique 

-  Usage  agreement  precludes  attacking  anonymization  to  determine 
user  identity. 

-  Low  order  24  bits  of  MAC  also  anonymized 

.  List  of  known  wireless  MAC  addresses  provided 
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Technical  Approach  -  1 


Tried  to  use  vlan  tag  fields  to  avoid  altering  YAF 
record  format. 

Use  the  Forward  and  Reverse  vlan  tag  fields  to  get 
source  and  destination  MAC  addresses  into  the  yafscii 

Since  these  are  16  bits  use  perfect  hash  of  MAC 

Problems: 

-  vlan  tag  is  in  unidirectional  extension  of  flow.  Need 
both,  even  for  unidirectional  flows. 

-  would  like  to  use  with  real  time  and  when  MAC  set 
not  completely  known 
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Technical  Approach 

We  added  MAC  to  the  bidirectional  flow  root  in  yaf,  with 
both  source  and  destination  MAC  addresses. 

There  are  a  number  of  subtleties  here,  including  the  use  of 
memcopy  that  introduces  field  order  dependencies  (an 
IPv4  optimization)  and  the  assumption  that  MAC  flag 
implies  vlanid  not  zero. 

Once  the  MAC  addresses  are  into  the  yafscii  output,  we 
started  converting  it  into  SiLK  for  further  data  analysis 

Shortly  after  we  finished,  CERT  added  MAC  address 
support  to  YAF  and  we  will  use  it  in  the  future. 
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Technical  Approach 


We  created  a  module  yafscii2tuc.c 

-  Inserts  minimal  perfect  hash  index  of  MAC  in  in  /  out 

-  Adds  sensor  id  from  command  line  to  identify  the  sniffers. 

We  split  the  output  of  the  yafsciUtuc  into  separate  hourly 
streams  and  use  popen  to  send  each  one  to  a  separate 
invocation  of  rwtuc  so  that  the  resulting  files  are  in  a 
proper  date  hierarchy. 

We  also  use  rwsort  on  the  rwtuc  output  to  ensure  time 
order  and  because  rwtuc  does  not  compress. 
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Minimal  perfect  hashes 


A  Minimal  Perfect  Hash  maps  a  set  of  N  unique  strings  into 
integers  in  [0...N-1] 

-  Packages  available  on  internet  designed  for  null  terminated  strings 

-  Modified  for  counted  strings 

-  Extracted  all  MACS  from  Dartmouth  packet  data 

-  Grouped  to  bring  common  usages  together,  e.g.  known  wireless, 
gateways,  etc.  then  created  MPH 

-  17000+  MACs,  1 1,000+  with  IP  packets. 

Lookup  is  constant  time,  collision  free 
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MAC  types 

There  are  5  categories  of  MACS  actively  involved 

-  Known  Wireless  MACs  with  IP  traffic 

-  Other  MACs  with  IP  packets 

-  Multi  cast  MACs 

-  Gateway  MACs 

-  Broadcast  MACs 

A  large  number  of  MACs  have  no  IP  traffic 

•  Some  appear  only  at  link  layer,  others  in  MAC  list  but  not 
seen 

We  used  rwfilter  to  build  sets  for  each  type  of  MAC  address 
based  on  the  input  and  output  field  values 
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Project  Outcomes 


We  found  some  interesting  information  during  analysis 
of  the  datasets.  There  are  traces  which  shows  some  IP 
addresses  appeared  in  two  different  sniffers  located  to 
different  locations. 

The  reason  may  be  the  physical  location  of  sniffers  for 
collecting  data.  Though  sniffers  were  not  located  at 
proper  distance  from  each  other,  there  might  be  the 
chances  for  getting  same  IP  traces  in  two  different 
sniffers. 

This  seems  improbable  and  needs  further  study 
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Remaining  problems 


yaf  does  not  deal  with  decreasing  time  well 

-  In  live  capture,  packets  are  always  in  increasing  time  order  no 
matter  what  the  clock  says 

-  In  playback  the  same  holds  unless  the  file  has  been  reordered. 

-  Several  Dartmouth  sensors  exhibit  decreasing  time,  probably  due 
to  ntp  or  other  clock  adjustments. 

Data  from  one  of  the  sensors  “breaks”  the  pipe 

-  This  may  be  related  to  the  time  problem  above  or  may  be  due  to 
another  problem 

-  Truncated  packets  may  lead  to  other  pathologies  in  yaf 
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Next  steps 


We  want  to  reassign  the  IPs  currently  used  to  a  consistent  IP 
that  is  related  to  the  MAC  index. 

First  we  need  to  determine  if  any  wireless  IPs  are  associated 
with  gateway  MACs. 

-  This  would  occur  if  a  wireless  unit  talked  to  another  wireless  unit 
via  a  routed  connection,  e.g.  units  connecting  via  separate  sniffers. 

-  Start  by  creating  sets  for  each  MAC  type  and  looking  for 
intersections 

-  May  have  to  explore  DHCP  strategy  in  more  detail. 

This  is  currently  underway. 
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Next  Steps 


With  the  technique  we  used  for  this  research  should  prove  useful 
for  similar  data  from  wireless  “hot  spots”,  airport,  hotels  and 
convention  center  networks  and  more. 

Same  approach  can  be  used  to  analyze  data  by  using  MAC  layer 
information  in  Flow  Analysis  tools  to  identify  the  activities  and 
movements  of  nodes  in  Wireless  Networks. 
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Outl  i  ne 


^ ■ 

■  Introduction 

□  Why  do  we  need  a  large-scale  collection  system? 

□  What  is  Flow  Mediator? 

■  Requirements 

□  I  tried  to  explore  the  possibility  of  a  large-scale 
collection  system  for  large  networks. 

■  Heuristic  method  of  designing  traffic  collection 
system 

□  Estimate  number  of  flow  records  after 
aggregation  or  sampling 

□  Adjust  several  parameters  based  on  this  result 

■  Summary 
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I  ntroduction 


■  Traffic  volumes  in  ISP  networks  are  becoming 
huge  in  the  last  few  years. 

The  number  of  exported  flow  records  is  becoming  so  huge  that 
a  single  collector  cannot  handle  them. 

■  A  smaller  sampling  rate  makes  small  flows 
invisible. 

Even  if  traffic  grows,  network  operators  would  like  to  maintain 
the  same  sampling  rate  as  much  as  possible. 

■  Aggregated  flow  records  from  router  make  port 
number  or  I P  address  invisible. 

Exporting  5-tuple  flow  records  from  router  is  better. 


The  demand  for  a  large-scale  traffic-collection  system  is 
growing. 
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What  is  Flow  Mediator? 


■  Flow  Mediatort  is  a  device  that  "mediates”  flow 
records  and  has  the  following  functions: 

□  collects  Flow  Records  from  various  exporters 

□  stores  original  flow  records 


r|  Network  Customer  Network 

3QQPGQ9tGS  flow  designer  Service  Operator 


t  draft-kobayashi-ipfix-mediator-model-01  .txt 
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You  can  easily  make  Flow  Mediation  code 


■  Net:: Flow  perl  module  is  available  on  CPAN. 

□  http://search.cpan.org/~akoba/Net-Flow-0.02/ 

The  module  can  encode  and  decode  NetFlow/l  PFIX  packets. 

□  The  encoding  and  decoding  functions  have  a  similar  I F. 
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Requirements 


■  Make  traffic- col  lection  system  to  meet 
following  requirements 

□  Requirement  1:  measure  traffic  flow  of  entire 
networks 

■  measure  traffic  matrices  PoP  by  PoP  and  router  by 
router 

□  Requirement  2:  store  received  5-tuple  flow 
records  from  router 

■  When  traffic  incident  happens,  allow  inspection  of 
traffic. 

□  Requirement  3:  design  scalable  architecture  to 
accommodate  large  ISP  traffic  volume 
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Goal 


■  Explore  heuristic  method  of  designing  collection 
system  for  introduction  into  actual  network. 

■  Proposed  collection  system  needs  to  accommodate 
following  network  model. 

□  Total  traffic  volume  500  Gb/s ,  100  Mp/s 

■  Edge  Router  20/PoP*10  PoP  =  200 

■  NetFlow  is  enabled  on  I  ngressl  F  of  Edge  router. 
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Hierarchical  Collection  System 

■  Mediators  are  allocated  in  each  PoP. 


□  They  store  all  flow  records,  aggregate  them, 
and  export  them  to  next  collector. 


Top  Collector 

□  measures  wide-area 
traffic  matrices,  such  as 
router  by  router,  pop  by 
pop. 

I  nspection 

□  If  traffic  incident 
happens,  we  can  retrieve 
detailed  flow  records 
from  Flow  Mediator. 


Requirement  1 


Requirement  2 


10  PoPs,  20  routers/ PoP,  Mediators  are  located  in  each  PoP. 

Core  Edge  6~~^)  Edge  •  Observation  Point 
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Visualize  Traffic  Matrices 


■  Top  collector  can  visualize  Router/ PoP/ AS  Traffic 
Matrixes. 


Nail  is  the  name  of  our 
traffic  matrix  visual izer. 


Color  indicates  traffic 
volume  of  Source/ 
Destination  pair. 


No?l  ,nd“ 


MX  Top 


EX  Top 


We  can  select  all  traffic  or 
popmatrix  (mpls)  specific  VPN  (customer). 


Router 

IP  ver  f 

<*PN 

ALL 

V 

|Rv4+v6 

.Jall 

vJjOK 

Source 

PoP 


2006/10/13  10:15  -  2006/  Destination  ^  (bps) 

< -  PoP  - > 

1  2  mm  3  4'>7NI/  Sg*  other 


i 

111,943 

130,446 

67,036 

28,465  207,457 

2  mm 

98,221 

113,313 

59,129 

25,141  180,611 

115,853 

132,457 

69,394 

30, 2 17  ji-b 

4  'sTYil 

118,151 

139,980 

69,873 

29,305  212,966 

S  J?t|\ 

58,000 

62,718 

36,616 

16,467  109,046 

other 
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Heuristic  Design  Method 

■  Suitable  values  of  several  parameters  are  decided 
by  the  following  steps. 

□  Step  0:  measure  performance  limit  of  flow  mediator  and 
top  collector. 

□  Step  1:  reveal  relation  between  number  of  flow  records 
and  packet  sampling 

□  Step  2:  reveal  relation  between  number  of  flow  records 
and  aggregation  that  depends  on  several  factors. 

■  Aggregation  methods  (BGP  Next- Hop,  Prefix,  host) 

■  Aggregation  interval  time  (20  s,  60  s,  90  s...) 

□  Step  3:  select  suitable  value  within  performance  limit. 

■  Large  sampling  rate  is  preferable. 

■  Small  granularity  of  aggregation  is  preferable. 
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Consideration  Points 


■  List  several  considerations,  as  follows. 

□  Maximum  performances  of  the  top  collector  and 
mediators  are  5  Kf/s  and  10  Kf/s. 


10  PoPs,  20  routers/PoP,  Mediators  are  located  in  each  PoP. 


j  Core  Edge  Edge  •  Observation  Point 
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Step  1:  estimate  flow  records  after  sampling 


Estimate  number  of  flow  records  based  on  density 
function  of  packets  per  flow  . 

□  #  of  packets  per  flow:  x  F(x)  =  0.5  *x  7  75 


□  Sampling  rate:  1/r 

□  Total  number  of  unsampled  flow:  faii 


a 

o 


ui 

a 

CD 

0 


f sampled  iMi  -l/r)x)xF(X)x  fall 


X—\ 


Extraction 

probability 


0.5x 


-1.73 


Roughly  estimate  as  follows. 

100  Mpps  -5-  20  packets  =  5  Mf/s 


Approximate  #  of  flows  when  total  traffic  volume  is  500  Gb/s. 


Sampling  rate 

1/100 

1/1000 

1/10000 

fsampted 

305  kf/s 

43  kf/s 

5.2  kf/s 
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Too  many  flow  records  without  mediator 


■  Even  if  sampling  rate  is  1/10,000  packets,  the 
number  of  flow  records  exceeds  performance  limit. 


Sampling  rate 

1/100 

1/1000 

1/10000 

fsampled 

305  kf/s 

43  kf/s 

5.2  kf/s 

Max.  5Kf/s 


l  - 

>.2  kf/  j 

5  J 

Sampling  rate 
=  1/ 10000 
packets 


10  PoP,  20  routers/ PoP,  Mediators  are  located  in  each  PoP. 

F— -j  Core  Edge  F~~j  Edge  •  Observation  Point 
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Aggregation  Ratio  ( =fe/fr ) 


Step  2:  flow  records  after  aggregation 


■  What  is  the  #  of  flow  records  after  aggregation? 

■  Mediator  aggregates  unsampled  flow  records  at  20-second 
interval. 


□  Aggregation  efficiency:  Prefix  >  HOST  >  Pair  Prefix  >  Pair  HOST  > 
Bi-Flow 

■  The  prefix  length  724"  is  uniformly  applied  to  Prefix  Aggregation. 

■  Bi-flow  is  aggregated  from  two  flow  directions.  _ 


1 

0.9 
0.8 
0.7 
0.6  V 
0.5 
0.4 
0.3 
0.2 
0.1 


+-H-+-l-F+4H-4+HH--K+^+++++-HHh+ _KHH^-f^+++^+ 


~*S~***»***»**.,\,s+* 


0 


♦  PAI  R_HOST 
■  SRCHOST 

DST_  HOST 
PAI  R_  PREFIX 
X  SRC_ PREFIX 

•  DST_  PREFIX 
+  Bl  FLOW 


0 


300 


600 


900 


Elapsed  Time(s) 
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Step  2:  Flow  records  after  aggregation,  sampling 


Each  aggregation 
method  becomes 
ineffective  gradually. 

Bi-flow  becomes 

ineffective 

immediately. 

□  sensitive  to 
sampling  rate. 


Sampling  rate  1/ 1024 


-  FLOW 

♦  PAIRHOST 
IPSRCADDR 

▲  IPDSTADDR 
PAIR_  PREFIX 
X  SRC_ PREFIX 

•  DST_  PREFIX 
+  BIFLOW 


J  an  8,  2008 


FloCon  2008 


15 


Step  2:  Which  factor  influences  aggregation? 


■  Aggregation  ratio  depends  on  several  factors. 

□  Traffic  Volume  through  observation  point. 

□  Sampling  rate 

□  Aggregation  interval  time 

I  guess  that  the  aggregation  ratio  depends  on  the 
number  of  flow  records  received  in  interval  time. 


Received  Flows 

3450 

3562 

Aggregation  Interval  Time  (s) 

10 

300 

Sampling  rate  (1/r) 

1 

128 

DST HOST  Aggregation  ratio 

45% 

43% 

DST  PREFIX  Aggregation  ratio 

30% 

32% 
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Step  2:  Which  factor  influences  aggregation? 


■  I  plotted  all  experimental  data  into  one  graph. 

□  Three  MAWI  traffic  data  samples  have  different  volumes. 

□  Aggregation  I  nterval  time  :  5  -  300s 

□  Sampling  rate  :  1/1  -  1/1024 


#  of  flow  records 


Aggregation  ratio  depends  on  number  of  received  flow  records. 
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Step  2:  Formulation  of  Aggregation  Ratio 


■ 


■  Aggregation  ratio  ( R )  can  be  estimated  from 
number  of  flow  records  if- ),  as  follows. 

□  DST  Host  aggregation:  Rdsthost  =  1-80  x  fr  018 

□  DST  Prefix  aggregation:  Rdstprefix  =  2.34  x  f~026 

■  After  all,  the  aggregation  ratio  depends  on  the  # 
of  unique  hosts  or  prefixes  versus  #  of  flows. 


log 


DST 

Hosts 


Aggregation  ratio  = 
DST  Hosts/Flows 


•  •  •  • 


#  of  flow 
records 


log 


#  of  flow  log 

records 
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Step  3:  Selection  of  Suitable  Values 


■  I  selected  suitable  value  within  performance  limit. 


Sampling  Rate 

1/100 

1/1000 

1/10000 

#  of  received  flow 
records  in  top 

DST_HOST 

aggregation 

1  nterval 

time  =  60s 

45  kf/s 

9.0  kf/s 

1.6  kf/s 

collector 

( =Ife ) 

DST_  Prefix 
aggregation 

1  nterval 

time  =  60s 

21  kf/s 

4.7  kf/s 

0.94  kf/s 

7.0  kf/s 

DSTHOST 

aggregation 

1  nterval 

time  =  300s 

34  kf/s 

1.2  kf/s 

DST_  Prefix 
aggregation 

1  nterval 

time  =  300s 

12  kf/s 

3.0  kf/s 

0.62  kf/s 

#  of  received  flow 
records  in 
mediator  {fr ) 


30  kf/s 


4.4kf/s 


0.6  kf/s 
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Example  of  collection  system 


■  Sampling  Rate:  1/1000 

■  Aggregation  I  nterval  time:  60  s 


J  an  8,  2008 


1.7  kf/  s 

T raffic  Matrix 
View 


•Max.  5  kf/s 


Method  =DST  Prefix 
Aggregation, 

I  nterval  time  =  60  s 


10  PoPs,  20  routers/PoP,  Mediators  are  located  in  each  PoP. 


~ 3  Core  Edge 

Q  Edge 

NetFlow 
®  observation 
Point 
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Conclusion 


■  To  make  large  scale  traffic  collection 
system,  flow  mediator  is  efficient. 

■  Revealed  relation  between  number  of  flow 
records  and  several  factors: 

□  Traffic  volume 

□  Sampling  rate 

□  Aggregation  method 

□  Aggregation  interval  time 

■  Demonstrated  that  traffic  collection 
system  using  mediator  can  be  introduced 
into  actual  large-scale  networks. 
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Thank  you  for  your  attention 


This  study  was  supported  by  the  Ministry  of  I  nternal  Affairs  and  Communications  of  J  apan. 
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Abnormal  traffic  detection 

and  alert 


Yiming  Gong 
XO  Communications 

http://security.zz.ha.cn/flocon2008.pdf 

Flo  con  2008 
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•  XO  network 

-  OC-192  IP  backbone  with  0012  uplinks  in  our  markets 
and  data  centers,  AS  2828 


•  Backbone  level  abnormal  traffic  detection 
-  netflow 
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•  Commercial  product  not  good  enough 

-  You  get  what  GUI  gives  you 

-  Very  likely  to  miss  low  volume  traffic  attack 

•  (storm  worm,  scans) 

-  By  default,  alert  based  on  thresholds 

-  Lacking  data  mining  ability 

-  Cost 

•  Free  flow-based  tool 

-  Powerful  but  you  need  tell  them  what  to  do 
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•  Detect  network  abnormal  traffic 

-  both  low  and  high  volume 

•  Non-threshold  based 

•  Automatically 

•  Fully  controlled  and  customized 

•  Data  mining 

•  Better  be  free 
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Bits  per  Second 


In  a  perfect  world 


•  Smooth  curve,  recurrent  traffic  pattern 


•  Spike  means . ? 
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*0.  Our  thought 


•  Break  down  raw  netflow  records  to 

-  TCP  SYN,  UDP  total,  ICMP  type|code,  protocol  on  each 
IFIndex  of  each  edge  router 

•  Session 

•  Traffic 

•  For  each  element 

-  establish  a  weekly  traffic  profile 

-  Profile  is  a  band 

•  When 

-  real  data  higher  than  the  tolerant  (upper)  band 

-  Match  some  other  conditions 
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Head  results 


•  $  head  -4  SYN_session_routerx_past_5_minutes 

•  9  3005 

•  75  2844 

•  76  2121 

•  8  2120 

•  $  head  -4  SYN_traffic_routerx_past_5_minutes 

•  9  137792 

•  75  128952 

•  8 101084 

•  76  100092 

•  $  head  -4  PROTO_session_routerx_past_5_minutes 

•  6  668344 

•  17  104205 

•  50  22725 

•  1  4517 

•  $  head  -4  ICMP_typecode_session_routerx_past_5_minutes 

•  0  5431 

•  2048  1953 

•  2816  792 

•  771  586 
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Dynamic  profile 


•  example 


200 


IRLosAngeles  TCP  SYN  traffic  on  IFindexSl 


\ 


□  Failures 

■  Expected  value 

□  SYN  Traffic  on  IFS1 


Current:  124242 


AVERAGE:  58951 


MAX:  169622 
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•  Establishing  a  profile 

-  Using  NFDUMP  receive,  store  and  process  netflow  data 

-  rrdtool  with  aberrant  behavior  module 

-  rrdtool  fhttp://oss.oetiker.ch/rrdtool/) 

-  aberrant  behavior  module 

•  Learns  from  past  values  and  uses  them  to  predict  the 
future 
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xo. 


Dynamic  profile 


yiming>  more  I R-syn -Amsterdam 
13  1864 
9  144 
21  85 

rrdtool  create  IR-syn-Amsterdam.rrd  -s  300 
DS:  13  :GAUGE:  1200:0:11  \ 

DS:9:GAUGE:  1200:0:  U  \ 

DS:21:GAUGE:  1200:0:11  \ 

RRA:HWPREDICT:2016:0. 001:0. 0035:288 

rrdtool  tune  IR-syn-Amsterdam.rrd  — deltapos  8 
#deltapos  set  the  scale  parameter  for  the  upper  tolerant  band 
#different  element  should  use  different  value 
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•  Only  an  entry 

•  IR-syn-Amsterdam:  [1196800800]RRA[FAILURES][1]DS[13] 
=  1.0000000000e+00 

•  Need  script  do  the  trace  back  work 

-  Every  10  minutes,  scans  the  rrd  output  for  failures 

-  now  rrdtool  generates  a  failure  alert,  so  what? 


www.xo.com 


CONFIDENTIAL©  2007  XO.  ALL  RIGHTS  RESERVED.  XO,  THE  XO  DESIGN  LOGO,  XOPTIONS  AND  ALL  RELATED 
MARKS  ARE  TRADEMARKS  OF  XO.  ALL  OTHER  TRADEMARKS  ARE  PROPERTY  OF  THEIR  RESPECTIVE  OWNERS. 


•  Tracking  down 

-  Past_5|10  minutes_flow  of 'TCP  +  SYN  bit  only  +  IFindex 
13  +  router  Amsterdam' 

-  "and" 

-  Who  is|are  behind  the  spike? 

•  Spike  should  be  caused  by  one  or  several  hosts 

•  these  hosts  can  be  either  victims,  attackers  or 
normal  hosts 

-  Scan  ->  attacker 

-  DoS|DDos  ->  victim  or  attacker 

-  Email  server  and  others 

•  They  have  too  many  sessions  or  traffic 

-  How  many  is  too  many? 
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xo 


Finding  active  host 


•  For  different  protocol,  different  network,  the  definition  of 
"too  many"  is  different 


•  Alert!  ICMP  ping  Used  to  be  1,  now  is  10! 
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Finding  active  host 


•  After  rrdtool  generates  a  failure 

session-icmp*) 

alert-trigger-number="500";  #conditon  a 
flowfilter="proto  icmp  and  port  2048  and  if  $if"; 
session-generated-by-single-host="280";  #conditon  b 

■  ■ 

r  r 

session-syn*) 

alert-trigger-number="2000"; 
flowfilter="proto  tcp  and  flags  2  and  if  $if"; 
session-generated-by-single-host  ="600"; 

•  A  failure  matches  all  the  conditions  can  be  regarded  as 
a  real  failure  and  further  actions  will  be  needed 
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xo. 


Netflow  records 


•  Pull  out  necessary  data 

•  Generate  alert 

-  Picture,  email 
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*0 


Alert 


•  Scan  alert 


>IR  LosAngeles  has  5462  sessions  on  proto  tcp  and  flags  2  and  if  50  in  5  minutes 

50  =  STRING :  I 
50  =  S-RING:| 

>snapshot  picture 

htt  p :  os  Angel  es-50-abnormal.png 

>one  week [month  picture 

htt  p :  //^^^^■^^^^^■LosAngel  es-50-abnormal  -week  .png 
http  Angel  es-50-abnormal  -month,  png 

>Top  IPs  in  10  minutes 


Date  first  seen 
2007-12-05  08:51:02.520 
2007-12-05  08:51:20.493 

Duration  Proto 
289.718  any 
130.413  any 

ip  Addr 
218.  233. 

218.  234*1.  S 

Fl  ows 
2114 

605 

Packets  Bytes 

2114  84560 

605  24200 

PPS 

7 

4 

bps  bpp 

2334  40 

1484  40 

>Top  IP  info 

*  AS  IP 

AS  name 

FQDN 

213.233.^^^H  Tel  ecom 

213.  234  Tel  ecom 


www.xo.com 


CONFIDENTIAL©  2007  XO.  ALL  RIGHTS  RESERVED.  XO,  THE  XO  DESIGN  LOGO,  XOPTIQNS  AND  ALL  RELATED 
MARKS  ARE  TRADEMARKS  OF  XO.  ALL  OTHER  TRADEMARKS  ARE  PROPERTY  OF  THEIR  RESPECTIVE  OWNERS. 


xo 


Alert 


•  Day 


IRLosAngeles  ifindex  1  day  sessions 


Tue  12:00  Tue  16:00  Tue  20:  O0  Wed  OQ:  00  Wed  04  :  00  Wed  08 :  00 


U  Failures 

□  Expected  value 

□  Real  session  on  ifindex  50  Current:  5279  AVERAGE:  2402  MAX:  14633 


■ 


www.xo.com 


CONFIDENTIAL©  2007  XO.  ALL  RIGHTS  RESERVED.  XO,  THE  XO  DESIGN  LOGO,  XOPTIONS  AND  ALL  RELATED 
MARKS  ARE  TRADEMARKS  OF  XO.  ALL  OTHER  TRADEMARKS  ARE  PROPERTY  OF  THEIR  RESPECTIVE  OWNERS. 


xo 


Alert 


•  Week 


IRI_osAngel.es  ifindex  one  week  sessions 


□  Failures 

93  Expected  value 

□  Real  session  on  ifindex  50  Current:  5683  AVERAGE:  809  P^lAX :  7350 


I 
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*0 


Alert 


•  Scan  alert 


>Top  ip  detail 
/  ip  218.233.1 


**Traceroute  (from  hop  5  to  9) 


5 

6 

7 

8 
9 


5.106.  6. 
i  n .  ntt .  n 
tt.net  ( 
gi n. ntt . 
tt.net  ( 


is 

19.12)  7.089  ms 

7.317  ms 

5.20)  73.034  ms 

73.922  ins 


**Protocol  summary  for  218.233.198.25 


Proto  Flows  Packets 
6  2116  2116 

17  1  1 


Bytes 
84  64  0 
2  57 


pps  bps  bpp 

7  2337  40 

0  0  257 


**sampled  netflow  records 

TCP  218. 233.^^H:6000 

TC  P  213.233.  :  6000 

TC  P  213.2  3  3.  :  6000 

TC P  213.2  3  3.  :  6000 

TC  P  213.  2  3  3.  :  6000 

TC  P  213.2  3  3.  :  6000 

TC P  213.2  3  3.  :  6000 

TC  P  213.2  3  3.  :  6000 

TC P  218.2  3  3.  :  6000 

TCP  218.  233.  6000 


65.  99^^H:7212  5. 

65. 99^^B:T212  5. 

65. 99^^1:7212  5. 

65. 99.  5. 

65.99^^|:7212  5. 

65. 99 .  7212  _ 5. 

65.99 7212  5. 

65. 99.  3^^1:7212  S. 

65.99. ^^|:^212  5. 

65.  99.  3^^^1:7212  5. 


40  0 

40  0 

40  0 

40  0 

40  0 

40  0 

40  0 

40  0 

40  0 

40  0 


50 

50 

50 

50 

50 

50 

50 

50 

50 

50 


73 

73 

73 

73 

73 

73 

73 

73 

73 

73 
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*0 


Alert 


•  Scan  alert 


/  ip  218.234. 


**Tr ace route  (from  hop  5  to  9) 


5 

6 

7 

8 
9 


ms 

119.12)  7.135 

)  7.268  ms 

.5.20)  79.209 

)  74.073  ms 


ms 


ms 


^Protocol  summary  for  218.234. 


Proto  Flows  Packets  Bytes 
6  605  605  24200 


pps  bps  bpp 

4  14  84  4  0 


^sampled  netflow  records 

TCP  218.  234  .  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  234  .  6000 

TCP  218.  2  34.  6000 

TCP  218.  2  34.  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  234.  6000 

TCP  218.  2  34.  6000 


71.  60.  6588  _ 5. 

Tl.  60.^H:65SS  5. 

71.  60.^H:658S  5. 

■1.  60.  ^^1:6583  5. 

71.  60.^H:6588  S. 

71.  60^H:6555  S. 

■1. 60.^^B:65SS  5. 

71.  60.^H:6588  S. 

71.  60. 1^^1:6555  5. 

71.  60. 1^H:653S  5. 


40  0 

40  0 

40  0 

40  0 

40  0 

40  0 

40  0 

40  0 

40  0 

40  0 


50 

50 

50 

50 

50 

50 

50 

50 

50 

50 


73 

73 

73 

73 

73 

73 

73 

73 

73 

73 
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xo. 


alert 


•  Storm  worm 


IRicmp sumChicago  icmp  type|code  sessions 


8.0 


7.0 


6.0 


5.0 


4.0 


3.0 


1 


r.: 


□  Failures 

9  Expected  value 

□  Real  session  on  icmp  type|code  2048  Current:  3333  AVERAGE:  3301  MAX:  7413 
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*0 


Alert  -  one  week  later 


•  DDos 


IR  LosAngel es  has  177002  sessions  on  proto  tcp  and  flags  2  and  if  50  in  5  mi nutes 

50  =  STRING :l 
50  =  s“r:ng:1 

>snapshot  picture 

http  Angel  es-50-abnormal .  png 

xlme  week |  month  picture 

http  os  Angel  es-50-abnormal  -week,  png 


http  Angel  es-50-abnormal  -month,  png 

>Top  IPs  in  10  minutes 


Date 

fii 

rst 

seen 

Durati  on 

Proto 

2007- 

-12' 

-12 

09: 

:  00 : 

:  23. 

,  705 

320. 

,  317 

any 

2007- 

-12' 

-12 

09: 

:  00 : 

:  28. 

,  361 

297. 

,  573 

any 

2007- 

-12- 

-12 

09: 

:  00 : 

:43. 

,  293 

282. 

,  273 

any 

2007- 

-12 

-12 

09: 

:  00 : 

:4  3. 

,  269 

291. 

,  093 

any 

2007- 

-12 

-12 

09: 

:  00 : 

:43. 

,401 

289. 

,437 

any 

2007- 

-12 

-12 

09: 

:  00 : 

:  37. 

,  353 

288. 

,445 

any 

2007- 

-12 

-12 

09: 

:  00 : 

:  23. 

,  705 

311. 

,  869 

any 

>Top 

IP 

i  nf  □ 

l 

89.144. 
2:11.  2:11 
211. 206 
211.44. 
218.48. 
123. 214 
58.127. 


Fl  ows 

Packets 

Bytes 

173554 

176183 

10.  0  M 

104  8 

1056 

50688 

692 

708 

33984 

658 

667 

42688 

633 

684 

32832 

627 

640 

30720 

603 

618 

39552 

pps  bps  bpp 

550  261507  59 


3 

1362 

48 

2 

963 

48 

2 

1173 

64 

2 

907 

48 

2 

852 

48 

1 

1014 

64 

i 

5  |  89.144. 

211. 211 
211. 206 
211.44. 

[  218.48. 
123. 214 
58. 127. 

| 

Autonomus  system 
Telecom  Inc. 
Telecom  Inc. 
Telecom  Inc. 
Telecom  Inc. 
Telecom  Inc. 
Telecom  Inc. 

number  for Net  |  ;;  connection  timed 
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Alert  -  one  week  later 


•  Traceroute  returns  nothing 


>Top  ip  detail 


/  ip  89.144. 


“Traceroute  (from  hop  5  to  9)  << 


no  traceroute  info  here 


“Protocol  summary  for 

Proto  Flows  Packets 
6  173974  176610 


89.144. 

Bytes  pps 

10.0  M  5- 5-1 


bps 

26195-0 


hpp 

59 


“Sampled  netflow  records 


TCP  219 .  2  2:391 

TCP  12  3. 212. 2^  3  5 

TCP  219.  233. 38^8 

TCP  58. 124  .^^^B:4  3^ 

TCP  219.  2  51.^^^B:4049 

TCP  58. 12  3. ^^^B:  3642 

TCP  12  3. 2:l^^^B:2  340 

TCP  211. 20>;^^^H:42  56 

TCP  221. 14  3. ^^^B:4  31  3 

TCP  218. 23^^^^H:4353 


89.  144.^^*180  S. 

89. 144. ^^1:80  5. 

89. 144. 80  5. 

89. 144. 80  5. 

89. 144. ^^1:80  5. 

89. 144. ^^1:80  5. 

59. 144. 80  5. 

89. 144.  ^^1:80  S. 

59. 144.  ^^B:  SO  _ 5. 

89.144. ^^B:80  5. 


64 

64 

64 

64 

64 

64 

48 

64 

64 

48 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
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xo. 


Alert  -  one  week  later 


/  ip  211.211. 


**Traceroute  (from  hop  5  to  9) 


5  .  55]  ms 

6  521 

■£  165 . 

9  net  nis 


^Protocol  summary  for  211.211. 


Proto  Flows  Packets  Bytes 
6  1057  1065  51120 

^sampled  netflow  records 

TCP  211.  211.  29937 

TCP  211. 211.  32  301 

TCP  211. 211.  30596 

TCP  211. 211.  3  5  5^3 

TCP  211. 211.  2649^ 

TCP  211. 211.  31263 

TCP  211. 211.  ^^^■:2~'3T5 
TCP  211. 211.  34529 

TCP  211. 211.  ^^^■:2526T> 
TCP  211.  211.  59695 


pps  bps  bpp 

1  1174  48 


89. 144.  80  S. 

59. 144.  ^^B:  50  5. 

59. 144.  ^^B:  SO  _ S. 

59. 144.  50  5. 

59. 144.  ^^B:  SO  _ 5. 

59. 144.  50  5. 

59. 144.  ^^B:  SO  _ s. 

59. 144.  ^^B:  SO  _ 5. 

59. 144.  50  5. 

89. 144.  ^^B:  80  5. 


48 

48 

48 

48 

48 

48 

48 

48 

48 

48 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
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Alert  -  one  week  later 


200  k  ' 

180  k 

160  k 

140  k 

120  k 

IRLosAngeles  ifindex 

1 

day 

session  s 

“  ill  I 

m 

M 

H 

§ 

100  k 

80  k 

60  k 

40  k 

20  k 

i 

i 

1 

i 

V 

- 

1 

1 

J 

- A-  --  ,r  , 

— - — - 

Tue  12:00 

- ^  ^ 

Tue  16:00 

Tue  20:00 

Wed  C 

30:00  Wed  04:00 

Wed  08 : 0( 

3 

□  Failures 

■  Expected  value 

□  Real 

session 

on  ifindex  50  Current: 

16130S 

AVERAGE 

:  9734 

MAX :  161303 
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Alert  -  whitelist 


•  Special  customers 

>T □ p  ip  into 


*  AS  IP 

■  |  64.39.^^^H 

63.  24  5^^^B 


AS  name 


FQDN 

scanner^^^^^^^^^B  com. 

corporation  core27^^^^^B 


com. 


>T o p  IP  detail 
/  ip  64.39.|j- 


**Traceroute  (from  hop  5  to  9) 


5 

B  c  -  fl  ■  .-ill 

B.  co-  fl  155.692  -'3 

I  c  o  ■-  7  _fl  I  5  5 .  3  6  ■-  3 

66.  824 

^Protocol  summary  for  64.39.|j-.65 


Proto  Flows  Packets  Bytes 
6  182  200  8584 

\L7  1  1  58 


pps  bps  bpp 

0  231  42 

0  0  58 


** Sampled  net flow  records 

TCP  64. 39.  2681 

TC  P  64.39.  37672 

TCP  54. 39. 3 52 06 

TCP  54. 39. 35315 

TC P  64 . 3  9 .  3  97  00 

TCP  64. 39.^^H:40293 

TC P  64 . 3  9 .  4  0210 

TC  P  64.39.  4  0  6  0  3 

TCP  64. 39.^^H:41626 

TCP  64. 39.^^H:41565 


63.  245.  ^^H:25  S. 

63. 245.  35459  _ 5. 

63. 24  5.  ^^H:4^123  _ 

63. 24  5.  25^0  _ 

63. 245.  34^39 _ 5. 

63. 24  5.  ^^H:26T3  _ S. 

63. 24  5.  5  3606  _ S. 

63. 24  5.  ^^■:63522  _ S. 

63. 24  5.  5  54  50  _ 

63.  245.  ^^H:2361  S. 


0 

0 

0 

0 

0 

o 

o 

o 

o 

o 
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eoeocopoeocococococci 


•  Whitelist  <cont> 

-  Email  servers 

-  We  don't  want  to  miss  real  attack  even  if  an  IP  is  on 
whitelist 

•  Alert  email 

-  Suppression  period 

-  Subject 

•  12-05  abnormal  sessions  at  LosAngeles  proto  tcp  and 
flags  2  and  if  50 


www.xo.com 


CONFIDENTIAL©  2007  XO.  ALL  RIGHTS  RESERVED.  XO,  THE  XO  DESIGN  LOGO,  XOPTIONS  AND  ALL  RELATED 
MARKS  ARE  TRADEMARKS  OF  XO.  ALL  OTHER  TRADEMARKS  ARE  PROPERTY  OF  THEIR  RESPECTIVE  OWNERS. 


•  Database 

-  3  tables 

•  IP,FQDN,AS 

•  Summary 

•  Raw  netflow  data 

-  Data  mining 

•  Which  peering  neighbor  sends  out  most  attack  traffic, 
who  is  the  most  attacked,  which  port  is  the  most 
popular  being  scanned. ..etc. 
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Data  mining 


•  Database 

-  3rd  party  outside  data 

•  Dshield  TOP  10000 

•  Dshield  AS 

•  CBL  data 

•  Mynetwatchman 

•  Our  own  darknet  project  output 

•  Other  private  outside  data 

-  If  XO  host  gets  involved,  these  tables  will  be  checked 
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problem 


•  Problem 

-  Peering  neighbor 

-  Alert  correlation 

•  But  you  can  do  it  in  database. 
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•  Nfdump  (or  any  other  free  flow  software),  rrdtool, 
mysql,  net-snmp,  dig,  apache,  some  unix  commands 

•  A  box 
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•  For  more  info 

-  vimina.aona@xo.com 

-  http://security.zz.ha.cn 


•  Thanks! 
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Visual  Representations 

of  Flow  Data 

and  the  Value  of  Visual  Language 


Human-Machine  Efficiency 


Over-Learned:  Feedback 


non-volitional  feedback 

SSSS&lSl 
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haptic 


volitional  feedback 


visual  /  aural 


Human-Machine  Efficiency 


Over-Learned:  Feedback 


volitional  feedback 


visual  /  aural 


Human-Machine  Efficiency 


Over-Learned:  Feedback 


non-volitional  feedback 

ASHSl 


>  correct  errors  in  production 

.  jSSHK  ' 
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haptic 


volitional  feedback 


visual  /  aural 


Human-Machine  Efficiency 

Over-Learned:  Feedback  -  haptic  vs  visual/aural 


Haptic  Feedback 


Visual  /  Aural  Feedback 


Sequential  access 


Sidewinder™ 


Human-Machine  Efficiency 

Over-Learned:  Feedback  -  haptic  vs  visual/aural 


Haptic  Feedback 


Sequential  access 


m 


Sidewinder™ 
Force  Feedback 


Visual  /  Aural  Feedback 


Falcon™ 


Random  access 


data/gesture  glove 


L 


voice  control 


multi-touch 
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Human-Machine  Efficiency 

Over-Learned:  Feedback  -  haptic  vs  visual/aural 


Haptic  Feedback 


Sequential  access 
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Sidewinder™ 
Force  Feedback 
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Visual  /  Aural  Feedback 
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Human-Machine  Efficiency 

Over-Learned:  Feedback  -  haptic  vs  visual/aural 


Haptic  Feedback 


Visual  /  Aural  Feedback 


Sequential  access 
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Human-Machine  Efficiency 

Under-Learned:  Representation 


arbitrary 


association 


FTP 

Server 


■ 


PCAP 


TXH  1138 


metaphor 


E  * 


representational 


m  m 


indexical 
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Flow  in  hyperbolic  space 


3  month  SSC  project  in  2002 
discover  and  apply  network  visualization  tools 
Hyperviewer:  quasi-hierarchical  hyperbolic  space 
‘fish-eye’  3-d 

Created  by  Stanford  researcher  Tamara  Munzner 


* 

.  - 


JHA  tti*.  r  •*  ~v 


- 1 - 

*«*’ 


Flow  in  hyperbolic  space 


Easily  adapted  to  a  forced-hierarchy  view  of  flow 
Opensource  C++  library  and  Ul 
Experimented  with  visual  methods 

-  colors 
graph  cycles 

-  scaling 

-  text  labels 

3  graph  size 

search  automation 


Symmetry  in  port  access  from  3  separate  clients. 
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src/dst  ports  colored  red/blue 
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Hierarchy  showing  client  subnet  and  server  ports 
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Shapes  Vector 


Acquired  by  DARPA  in  2002 
Developed  by  Australian  DSTO 

(Defence  Science  Technology  Organisation) 
JTF-GNO  pilot  program  from  2003-2006 


What  is  it? 

Intelligent  Agents  gather  information  and  produce  inferences 
Gathers  information  from  multiple  sources 
pcap,  flow,  Snort,  syslog,  etc 

lAs  performs  automated  data  correlation  &  knowledge  extraction 
Integrates  visual  and  command-line  analysis 
Integrated  visualization  makes  use  of  human  vision 
Supports  isual  analysis  and  decision-making 


©  2007  Sunny  Fugate 


Shapes  Vector 


Contextual 

Spatial 

Temporal 

Visual 


spatial,  temporal,  social,  topological 
physical  geography  or  metaphor 
sequences  in  time,  correlated 
use  visual  language  to  depict  objects  &  events 
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Architecture 


Agents  can  be  written  in  many  languages  - 

must  conform  to  the  SV  ontology  and 
knowledge  architecture  (SVKA)  specification 

Sensors  can  be  built  to  wrap  many 
information  sources  -  must  produce  SV 
ontology 

SV  ontology  is  a  knowledge  description 
language  for  network  defense 
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Shapes  Vector -Visual  Language 

Easily  defined  visual  mappings 
•  No  applied  theory  of  visual  language 


shape/color/scale  0  ♦ 


texture/icon 


in 
. . 
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connection  /  topology 


movement 
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packet  events,  information  exchange,  attribute  changes,  attribute 
values,  host  id,  software,  processes,  machine  purpose,  network 
topology,  social  topology,  intrusion  events,  event  type,  event  priority, 
client  vs  server,  routing, ... 
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Shapes  Vector -Visual  Language 

Easily  defined  visual  mappings 
•  No  applied  theory  of  visual  language 


shape/color/scale  0  ♦ 


texture/icon 


III 

. . 


!«!!? 


connection  /  topology 


movement 


:::: 

............... 


□ 
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values,  host  id,  software,  processes,  machine  purpose,  network 
topology,  social  topology,  intrusion  events,  event  type,  event  priority, 
client  vs  server,  routing, ... 
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Shapes  Vector -Visual  Language 
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•  No  applied  theory  of  visual  language 
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Shapes  Vector -Visual  Language 

Easily  defined  visual  mappings 
•  No  applied  theory  of  visual  language 
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Shapes  Vector -Visual  Language 

Easily  defined  visual  mappings 
•  No  applied  theory  of  visual  language 
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Shapes  Vector -Visual  Language 

Easily  defined  visual  mappings 
•  No  applied  theory  of  visual  language 


shape/color/scale 
texture/icon 
connection  /  topology 

movement 


packet  events,  information  exchange,  attribute  changes,  attribute 
values,  host  id,  software,  processes,  machine  purpose,  network 
topology,  social  topology,  intrusion  events,  event  type,  event  priority, 
client  vs  server,  routing, ... 
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Toplogical  layout  using  visual  demarcations 
(e.g.  firewall,  network  segment,  physical  layout) 
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Automated  layout  to  arrange  hundreds  of  sub-graphs  in  a 
non-overlapping  manner. 
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Color,  shape,  texture,  icon,  location,  arrangement 
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Visual  grouping,  demarcation,  and  detail-hiding 
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Shapes  Vector  Flow  Viewer 


JTF-GNO  funded  effort  to  implement  SV 

•  Use  SV  architecture  and  components 

*  DARPA  demo  system  >  operational  system 
New  scripts,  sensors,  agents,  and  GUI 

Results  j 

-  A  visual  augmentation  of  CLI 

Produces  a  view  of  social  topology 

-  Intuitive  view  of  gobs  of  data 

static  topology  and  event  replay 

-  Links  statistical  views  and  topology  view 
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Flow  Viewer 

:  GUI 

•  m  ^0 

multiple  stats  views  linked  to  visuals 
playback  specific  ranges  &  loop  I 
adjust  replay  velocity 
time-skip  I _ 

■immimimimiu^ 

IP  and  attribute  hotlists 
dynamic  filtering  controls  ^  ■  1 

GUI  managed  rwfilter 
-  filter  using  SV  ontology 
integration  between  flow, AMP,  IDS,  &  PCAP 
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Flow  Viewer 
Sensors 
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Flow  Viewer 
Sensors 


SVKA 
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Flow  Agent 

consumes  rwf  &  rwcut  data 
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Level  N 
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AMP 

Flow 

PCAP 
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Generic 

Sensor 


Flow  Agent 

consumes  rwf  &  rwcut  data 

AMP  Agent 

queries  database  for  most  recent  attributes 

PCAP  Agents 

queries  &  reconstructs  TCP  sessions 

Flow  Viewer 
Sensors 


Flow  Agent 


consumes  rwf  &  rwcut  data 


AMP  Agent 

queries  database  for  most  recent  attributes 

PC  AP  Agents 

queries  &  reconstructs  TCP  sessions 
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Flow  Viewer 
Sensors 
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Flow  Viewer 
Sensors 
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PCAP  Agents  queries  &  reconstructs  TCP  sessions 


IDS  Agents  processes  IDS  logs 


Flow  Viewer 
Intelligent  Agents 


Flow  Sensor 

Converts  flow  into  ontology 
produces  facts 


AMP  Agent 

uses  correlations  from  Flow  Agent 
query  made  on  every  unique  IP  seen 
produces  visual  events 


Flow 

Sensor 


AMP 

Agent 


Flow 

Agent 


t 


Flow  Agent 

correlates  records 
counts  and  corroborates 
produces  inferences 

produces  visual  events 
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Flow  Viewer 
Visual  Language 


Leverage  cultural  knowledge 

Id 
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3 

uJ 


Use  metaphors  for  abstract 
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www.navy.mil  DST  Port  3847 1 

SRC  Port  80 
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Scaling  host  or  packet  based  on  total  packet/bytes 


Color  by  ownership 

USA  AF  USN 
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Internet 
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Flow  Viewer 
Visual  Language 
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Test  installation 
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Flow  Viewer 


Visualization 

•  Tested  using: 

•  1 00-5000  nodes 

•  I M-3M  flows 

•  1 0K-300K  flows  per  hour 

Integrated  filtering  (rwfilter,  SVKA  filtering,  visual  filter) 
~  Visual  ID 

-  Queries 

-  Grouping  (e.g.  domain,  netblock,  vulnerability) 

-  Replay-mode  or  Real-time 

-  Historic  visual  context 

-  Replay  ‘on  top  of’  known  incident 
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Flow  Viewer 
data  prep 

Include  r  r 

•  Incoming  &  outgoing 

•  Hub  &  core-to-core  traffic 
«  Wide  port  ranges 

Time-span  wider  than  the  activity  (minutes  to  hours) 

•  Suspect  IPs  and  ranges 

Filter  Jfljlfc  -  i 

Superfluous  port  traffic  (e.g.  80, 53, 25) 

~  IPs  that  are  unrelated  to  the  incident 

Sampling  &  Time 
-  Dense  data 

Smear  data  across  time  resolution  (~  I  second) 
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Flow  Viewer  Performance 


60 


Minimum  Frames  per  Second 

Exceptional  30-60  fps 


48 


36 


24 


12 


Good  20-30  fps 


Acceptable  1 0-20  fps 


Unacceptable  <  10  fps 


#  of  visible  objects 


**Graphics  performance  on  dual  1 .5GHz  SPARC  SunFire  v440  with  Sun  XVR  1 200 
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Flow  Viewer  Performance 


Real-time 

Performance 

Optimal 

Acceptable 

Poor 


Real-time 
Records  /  Hour 

1 0K-30K/hour 
40K- 1  OOK/hour 
1 00K-300K/hour 


Optimal  playback  rate 

I  OX  Real-time 

Real-time 

I /I  OX  Real-Time 


Sparse  data  sets  can  be  viewed  quickly 
e.g.  months  of  data  in  minutes 


Dense  data  sets  can  be  viewed  slowly  or  filtered 
e.g.  seconds  of  data  in  minutes 
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Knowledge  Depth  vs  Breadth 


What  trade-offs  are  we  making? 

*  Ul  Feedback? 

*  Haptic  vs  visual  feedback 

•  Data  access? 

-  Random  sequential  access 

w  Training? 

Under-learned  vs  over-learned 
Tool  complexity 

-  Meaning? 

-  Visual  semantic  vs  text 
Intuitive/Iconic  vs  cryptic/coded 
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SPAWAR 

Systems  Center 
San  Diego 


Next  Generation  Tactical  Situation 
Assessment  Technology 
(NG-TSAT) 


Objective:  Next-generation  Tactical  Chat.  Icon-based  situation 
assessment  (SA)language  supported  by  wireless  gesture- 
recognition  gloves  used  in  hostile  or  noisy  (silence-mandated) 
environments 

Description  of  Effort: 

1.  Linguistic  Analysis:  Analysis  of  current  C2  chat  logs  to 
determine  speech  patterns  and  repetitive  SA  concepts/themes 

2.  Iconic  Language  Development:  Output  of  linguistic  analysis 
determines  candidate  icons  representing  most  prevalent  SA 
“themes;”  development  of  prototype  C2  iconic  SA  language 

3.  Wireless,  Gesture-Recognition  Gloves:  Develop  wireless 
gloves  that  recognize  C2  icons/gestures  which  can  transmit 
across  network  to  distributed  warfighters  (replacing  keyboard 
input  when  in  MOPP) 

Benefits  of  TSAT 

Compressed  Chat  (25%  i  content;  50%  i  reduction  In  production  time)  for  rapid  SA  dissemination. 
Gesture-recognition  in  very  noisy,  distributed  ops,  or  In  very  austere  environments  (e.g.,  the  moon) 

Challenges: 

1.  No  current  method  or  theory  for  chat-meaning  compression;  currently  done  in  prose;  computer 
linguistic  analysis  of  unstructured  text  still  neoteric. 

2.  Wireless  gesture  recognition  glove  technology  still  In  infant  stages  of  development;  focused  on 
commercial  animation  support,  not  on  disciplined  language  support 

TRL:  Chat:  TRL 1-2;  Gesture-recognition:  TRL 1-4 


Major  Milestones  FY06: 

Linguistic  analysis  discovery  of  common  Ca  SA  themes 
Development  of  icon/symbols  for  candidate  SA  themes 
Development  of  proof-of-concept  wireless  gesture-recognition  glove 

Period  of  Performance:  2007-2012 

PI  contact  Info:  Dr.  LorRaine  Duffy,  (619)  553-9222, 
LorRaine.Duffy@navy.mil,  SSC  San  Diego,  CA 
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Synaesthesia 


Synaesthesia:  "a  neurological  condition  in  which  two  or  more  senses 
are  coupled." 

"loud  color"  "sharp  laugh"  "bitter  wind" 


grapheme  color  synesthesia  -  letters  or  numbers  are  perceived 

as  inherently  olored 


How  many  numbers  contain  the  digit  6? 


9910  9972  3292  7602  82  9054 
5636  2710  1944  6330  6560  8101 
5177  1955  7029  4083  4643  5710 
4935  2256  1495  1025  8375  8518 
80  797  2610  3008  8784  1854  2383 
9728  4523  573  5914  7975  281 
6664  2682  7689  7753  273  5597 
799  9960  1437  4534  8601  4563 
6734  647  9409  6543  4827  2398 
1532 
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Is  this  easier? 


0  72  32  2  7602  02  0  636 

2710  1  6330  6560  8101  5177 

1  55  702  083  63  5710  35 

2256  1  5  1025  8375  8518  80  7  7 

2610  3008  878"-  185'’  23B3  728 

23  573  5  1  7  75  281  666  26  2 
768  7753  273  55  7  7  60  1  37 

3  8601  '  563  673  6  7  0 

65  3  1827  23  8  1 532 
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Emulating  Synaesthesia 


These  methods  can  be  used  achieve 
sequence  disambiguation  and 


9910  9972  3292  7602  82  9054 
5636  2710  1944  6330  6560  8101 
5177  1955  7029  4083  4643  5710 
4935  2256  1495  1025  8375  8518 
80  797  2610  3008  8784  1854  2383 
9728  4523  573  5914  7975  281 
6664  2682  7689  7753  273  5597 
799  9960  1437  4534  8601  4563 
6734  647  9409  6543  4827  2398 
1532 


3292  7602  82 

5636  2710  1944  6330  6560  8101 
5177  1955  7029  4083  4643  5710 

49;  2256  1495  1025  8375  8518 

80  797  26 1 0  3008  8784  1 854  2383 

573  5914  7975  281 
6664  2682  7689  7753  273  5597 
799  1437  4534  8601  4563 

6734  647  6543  2398 

1532 


Emulating  Synaesthesia 


1 29. 1 68. 1 .233 

92.  68.  .2^2 

29.  68.  .20 


Language  Domains 


American  English 
Grammar/Structure 


Standard 
American  English 


Mathematics 


Medicine 


Cultures  and  knowledge  domains  don’t  necessarily 
use  the  same  lexicon  or  even  the  same  grammar! 


Non-standard 
American  English 


American  English 
Concept  Map 


How  does  the  CND  lexicon  map  to  common  language? 
Technical  language?  Military/tactical  language? 
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Automating  the 
configuration  of  flow 
monitoring  probes 


Xenofontas  (Fontas)  Dimitropoulos  (xed@zurich.ibm.com) 
Andreas  Kind  (ank@zurich.ibm.com) 


IBM  |  Dec  07 


Systems  Department 


www.zurich.ibm.com 


Zurich  Research  Laboratory 


Outline 

■  Background  and  motivation. 

■  Probe  configuration  architecture: 

Requirements  and  goals. 
Design. 

Implementation. 

■  Future  work  and  conclusions. 
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Network  configuration 


■  Network  elements  are  typically  configured  with  low-level  commands,  e.g., 
Cisco  IOS  commands. 


■  Network  administrators 
manage  numerous  network 
elements  with  lengthy 
configuration  files. 

■  Network  configuration  is 
an  error-prone  and 
time-consuming 
process. 

■  Configuration  errors  can 
be  costly,  e.g.: 

network  outages 
violations  of  SLAs 


CO 

0 

c 


1*00  - 

1600  - 

1-100  - 


Configuration  file  length 
distribution  in  an 
enterprise  network 


Router  ID.  -sorted  t<  com tgu niton  Ite  due 


router  ID  (sorted  by  file  size) 


Source  of  figure:  100x100  project 


3 


X  Dimitropoulos  |  Systems  Department  |  IBM  Research 


Zurich  Research  Laboratory 


Probe  configuration 


■  The  configuration  of  monitoring  probes  is  part  of  the  more  general  network 
configuration  problem. 

■  Monitoring  probes  are  gradually  becoming  more  intelligent,  for  example, 
using  advanced  sampling  and  data  aggregation  techniques.  Consequently, 
their  configuration  becomes  more  involved. 

■  Flexible  Netflow  (FNF)  and  IPFIX  provide  numerous  configuration  options 
that  were  not  available  earlier: 

FNF  has  58  different  configuration  commands. 

FNF  provides  65  different  fields,  arbitrary  combinations  of  which  can  be  used  in 
the  definition  of  flow  key  and  non-key  fields. 

■  Certain  network  operation  applications  need  to  dynamically  change 
configuration  to: 

adapt  to  changing  traffic  conditions. 

investigate  on-going  network  anomalies. 
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Configuration  requirements 


traffic 

billing 

anomaly 

application 

traffic 

■N 

profiling 

detection 

identification 

engineering 

> 

application 

needs 


Probe 

configuration 


low-level 

configuration 


network  monitoring  probes 


network 

operation 

applications 


■  Probe  configuration  should: 

take  into  account  application  needs, 
be  aware  of  the  available  monitoring  probes. 

3  generate  low-level  configuration  commands. 

4  configure  or  update  the  configuration  of  probes. 
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Probe  configuration  architecture 


Three  modules: 

the  measurements  module 
describes  different 
measurements,  i.e.,  application 
needs. 

the  inventory  module  describes 
the  monitoring  probes  of  a 
network. 

the  back-end  module  provides 
necessary  information  for 
generating  low-level  commands. 

The  specification  identifies 
application  needs. 

The  configurator: 

uses  the  modules  and 
specification  to  generate  low-level 
commands. 


Monitoring  probes 


configures  the  probes 
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Design  goals  for  simplifying  configuration 

1.  Abstraction:  hide  low-level  configuration  commands. 

2.  Objective-oriented  configuration  expression: 

express  configuration  in  terms  of  measurement  objectives, 
focus  on  measurements  instead  of  devices. 

3.  Network-wide  configuration:  configure  a  network  instead  of 
configuring  individual  devices. 

4.  Re-usability:  make  parts  of  configuration  network-independent. 

5.  Extensibility:  easily  introduce  support  for  new  commands, 
measurements,  etc. 
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Configuration  abstraction  hierarchy 


■  1st  level:  vendor-specific  configuration 
commands. 

■  2nd  level:  probe  elements  (pe),  i.e., 
logical  components  of  a  probe,  like 
interface,  flow  cache,  exporter. 

■  3rd  level:  configlet,  i.e.,  a  set  of  specific 
probe  elements  that  realizes  a 
measurement. 

■  4th  level:  measurement  services,  i.e.,  a 
configlet  with  certain  probe  selection 
rules. 
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Back-end  module 


■  Specifies  different  probe 
elements. 

■  A  probe  element  specification: 

is  written  in  XML. 

has  a  unique  id. 

identifies  parameters 
and  parameter  default 
values. 

determines  the  low-level 
vendor-specific  commands. 


<!-  Probe  Element  Exporter-> 

<pe  id='generic_exporterl> 

<params> 

<param  id-port’>90</param> 

<param  id-transport'>udp</param> 

<pa ra m  id  =^destination'>192.0.0.1</pa ra m> 

<param  id-label'>EXPOR'TER</param> 

<params> 

<template> 

<ios> 

flow  exporter $la bel 
destination  $destination 
transport  $transport  $port 

</ios> 

<yaf> 

-out  $destination  —  ipfix  $tra  nsport  — ipfix-port  $port 
<!  yaf> 

<junos> 

</junos> 

<template> 

</pe> 
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Inventory  module 


■  Specifies  network  probes,  i.e.,  lists 
the  characteristics  that  can  be 
useful  for  their  configuration. 


■  Besides  describing  location, 
system,  and  interface  information, 
it  declares  tags  that  can  be  used 
for  grouping  probes  and  for  probe 
selection. 


<probe  id-trabant.zurich.ibm.com1> 
<a  d  d  re  ss>9.4.68. 154<  a  d  d  re  ss> 

<location> 

<c  ity  >Zu  ric  h  </  c  ity  > 

<state>Central  CH</state> 

<c  o  u  ntry  >Sw  itze  rla  nd  <1  c  o  u  ntry> 
</location> 

<system> 

<os>ios</os> 

<\/e  rsio  n  >12.4<  ve  rsio  n  > 
</system> 

interface  id-FastEthemetO/0'> 

<c  a  p  a  c  ity  >100M  b  its<  c  a  p  a  c  ity  > 
<ta  g  >inte  ma  l</ta  g  > 

</interface> 

interface  id-FastEthemetO/l'> 

<c  a  p  a  c  ity  >100M  b  its<  c  a  p  a  c  ity  > 
<ta  g  >c  usto  me  r</ta  g  > 
</interface> 

<tags> 

<tag>edge</tag> 

</tags> 

<probe> 
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Measurements 

module 


<J--  Monitor  how  much  traffic  is  send 
<!--  between  IP  blocks. --5 
<msr  id  -tra  fficma  trix'> 

<pa  ra  ms>  <!--  Defa  ult  pa 
<param  id-collector_a< 


<params> 

<!--  Probe  element  chain 
<configlet> 

</configlet> 

<mles> 

</rules> 

</msr> 


<!-  Probe  element  chain  --> 
<configlet> 

<pe> 

<na  me  >e  xp  o  rte  r</  na  me  > 


<rules> 

<interface> 

if  (  $inte rface. tag  eq  "external" and 

$pnobe.tag  eq  "edge")  { 

return  1; 

}else  { 

return  0; 

} 

<interface> 

<mles> 

PORTER<param> 

$c  o llec  to r_a  d  d  ress</  pa  ra  m> 

tor_port</param> 

o  lie  c  to  r_tra  n  sp  o  rt</ p  a  ra  m  > 


oe> 


_DST_PREFIX_REC  </pa  ra  m> 


</pe> 

<pe> 


VjJUIUIII  lU—LAIJUIL^II'l 

</params> 


<na  me  >inte rfa  c  e  </ na  me  > 

<params> 

<param  id-monitorJ>TlM_CACHE</param> 
<param  id-interface'>$interface->id<param> 
<param  id-direction'>output</param> 
<params> 


</pe> 

</configlet> 


X  Dimitropoulos  |  Systems  Department  |  IBM  Research 


mi 


Zurich  Research  Laboratory 


Input  specification 


■  Lists  the  measurements  and  the 
probes  in  which  to  enable  these 
measurements. 


■  Is  the  user  interface  and  can  be 
generated  through  a  GUI. 


<!--  Probes  to  apply  measurements  on  --> 
<probe  id-wassen.zurich.ibm.com'><probe> 
<probe  id -tra ba nt.zurich.ibm.com'x/ probe > 


<!--  Measurements-> 

<msr  id  -  tra  ffic  ma  trix'> 

<params><!--  overwrite  default  values -> 

<param  id-collector_address'>9.4.68.204</panam> 
<panam  id-collector_port'>2055</param> 

<param  id-collector_transport'>udp</param> 
</params> 

</msr> 

<msr  id  -a  p  p_mo  nito  ring  ’> 

<params><!~  overwrite  default  values -> 

<param  id-collector_address'>9.4.68.205</param> 
<param  id-collector_port'>2055</param> 

<param  id-collector_transport'>udp<param> 
<params> 

</msr> 
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Design  goals  for  simplifying  configuration 

1.  Abstraction:  hide  low-level  configuration  commands. 

2.  Objective-oriented  configuration  expression: 

express  configuration  in  terms  of  measurement  objectives, 
focus  on  measurements  instead  of  devices. 

3.  Network-wide  configuration:  configure  a  network  instead  of 
configuring  individual  devices. 

4.  Re-usability:  make  parts  of  configuration  network-independent. 

5.  Extensibility:  easily  introduce  support  for  new  commands, 
measurements,  etc. 
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Conclusions 

■  Described  an  architecture  for  simplifying  the  configuration 
of  flow  monitoring  probes: 

abstract  configuration  of  probes  and  hide  low-level  details. 

focus  on  measurement  services  that  satisfy  the 
objectives  of  applications. 

generate  and  set  configuration  automatically. 

■  Future  work: 

Incorporate  error-checking  techniques. 

Develop  libraries  for  typical  measurements. 

-  Use  NetConf. 

Configuration  optimization. 
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Outline 


Review  of  law  principles  and  requirements  on 
data  protection 

-  European  viewpoint 

-  What  is  personal  data? 

-  Why  is  data  protection  law  relevant  for  network 
monitoring? 

-  Law  principles  overview 

The  role  of  flow  data  anonymisation  to  support 
data  protection 

-  Discussion  on  its  applicability  and  weaknesses 

-  Suggestions  for  future  steps 
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Data  Protection  Law:  EU  Directives 


■  Goal:  protect  the  privacy  of  individuals 

-  Not  limited  to  information  confidentiality 

■  EU  Directives  define  the  the  minimum  law 
requirements  to  be  implemented  by  each  EU 
member  state 

-  Applicable  to  international  data  transfers  with  EU 

■  Relevant  to  data  protection: 

-  Directive  1995/ 46/EC  -  on  data  protection 

-  Directive  2002/ 58/ EC  -  on  privacy  and  electronic 
communications 
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Applicability  and  Personal  Data 


■  Directive  95/ 46/EC  applies  to  the 
processing  of  persona!  data" 

1  J- 

“any  information  relating  to  an  identified  or  identifiable 
natural  person  (data  subject');  an  identifiable  person  is 
one  who  can  be  identified,  directly  or  indirectly,  in 
v  7  particular  by  reference  to  an  identification  number  or  to 

\]  one  or  more  factors  specific  to  his . . .  identity'. 

"any  operation  performed  upon  personal  data,  such  as  e.  g. 
collection,  storage,  adaptation  or  alteration,  consultation, 
disclosure  by  transmission,  dissemination  or  otherwise  making 
available,  alignment  or  combination,  erasure  or  destruction" 


■  Note:  in  some  countries  (e.g.  Switzerland)  this  applies  to 
Jegal  entities'*  as  well 
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Applicability  to  Network  Monitoring 

■  Indirect  identification  data  comprise  any 
information  that  may  lead  to  identification  of  the 
data  subject  through  association  with  other 
available  information 

-  information  available  to  the  entity  in  charge  of  the 
data  processing  (ISP), 

-  any  information  possessed  by  third  parties 

■  I P  addresses  can  identify  someone  "directly" 

-  Esp.  legal  entities 

■  Many  more  attributes  in  a  flow  record  can 
contribute  to  identifying  someone  "indirectly" 


E.  Boschi,  R.  Gramigna  FloCon  2008,  Savannah,  GA,  USA 


Principles:  legitimation  for  processing 


1.  Consent 

2.  Data  processing  is  ..necessary  for  the  performance 
of  a  contract  to  which  the  data  subject  is  a  party? 

3. 


■  Processing  must  be  limited  to  specified  purposes 

■  Further  processing  of  data  for  historical,  statistical 
or  scientific  purposes  is  possible  provided  that 
appropriate  safeguards  are  provided 

—  Left  to  national  laws 
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Principles:  I  nformation  of  the  Subject 


The  subject  must  be  informed  about: 

1.  I  dentity  of  the  data  controller 

2.  Purpose  of  the  processing 

3.  Other  information,  e.g.  the  recipient  of  the  data. 

■  It  does  not  apply  to  scientific  research,  I F  the 
provision  of  such  information 

—  proves  impossible 

—  would  involve  a  disproportionate  effort 

■  Appropriate  safeguards  must  be  provided 
—  Their  specification  is  let  to  national  law 
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Border  Crossing 


■  Transfer  to  third  countries  is  generally  possible  if 
the  third  country  ensures  an  adequate  level  of 
protection 

http://ec.europa.eu/iustice  home/fsi/privacv/thrid 

countries/ index  en.htm 

■  E.g. 

\/  Switzerland,  Canada,  Argentina 
X  USA  (except  Safe  Harbor) 
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T raffic  data  and  location  data 


■  Introduced  in  Directive  2002/ 58/ EC 

-  Traffic  data ;  any  data  processed  for  the  purpose  of  the 
conveyance  of  a  communication  or  for  the  billing  thereof 

-  Location  data:  data  indicating  the  geographic  position  of 
the  terminal  equipment  of  a  user 

■  Objectives: 

-  Minimise  the  processing  of  personal  data 

-  Use  anonymous  or  pseudonymous  data  where  possible. 

■  , Anonymous"  =  it  is  no  longer  possible  to  identify 
the  data  subject 
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Processing  of  Traffic  and  Location  Data 


■  Traffic  and  location  data  relating  to  subscribers  and 
users  must  be  erased  or  made  anonymous  when  no 
longer  needed 

■  The  processing  of  traffic  data  must  be  restricted 

-  To  persons  acting  under  authority  of  providers 

-  To  certain  activities  (e.g.  traffic  management,  fraud 
detection...) 


■  Location  data  can  be  processed  only  if 

-  There  is  consent,  or 

-  Data  is  made  anonymous 


E.  Boschi,  R.  Gramigna  FloCon  2008,  Savannah,  GA,  USA 


The  Role  of  Flow  Data  Anonymisation  to 

Support  Data  Protection 


■  The  well  known  problem: 

-  The  more  you  anonymise  the  better  privacy  is  protected... 

-  ...but  the  less  useful  the  data 

■  Anonymisation  aims  at  removing  sensitive  information 
referring  to  an  individual 

■  Attacks  to  anonymisation  schemes  have  proved  that 
those  schemes  could  be  broken  allowing  to  "indirectly" 
identify  people. 

■  Are  known  flow  anonymisation  techniques  effective  in 
protecting  the  privacy  of  individuals? 
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(4)  Anonymization  Techniques 


Field  to  be  anonymized: 

I P  address 


IP 

Truncation 

Permutation 

Black 

Marker 

Prefix  Preserving 

135.98.111.17 

135.98 

141.  2.  32.37 

10.1.1.1 

22.131.88.67 

135.98.111.128 

135.98 

41.12.96.  67 

10.1.1.1 

22.131.88.157 

135.98.132.37 

135.98 

142.72.8.5 

10.1.1.1 

22.131.201.29 

141.161.3.3 

141.161 

21.33.4.1 

10.1.1.1 

12.192.32.51 

141.72.8.5 

141.72 

11.14.96.118 

10.1.1.1 

12.78.201.97 

32.53.48.1 

32.53 

12.161.3.3 

10.1.1.1 

31.197.3.82 
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Some  Anonymisation  Attack  Methods 


Data  injection  I  injecting  information  to  be  logged  with  the 

purpose  of  later  recognizing  that  data  in 
the  anonymized  trace 

■  Fingerprinting  I  matching  attributes  of  an  anonymized  object 

against  those  of  a  known  object  (e.g.  web 
server)  to  discover  a  mapping  between  them 


■  Semantic  attacks 


system  is  exploited  in  a  way  that  the  victim 
thinks  to  do  something,  but  is  doing 
something  different.  The  attacker  may  infer 
part  of  the  unanonymized  I P  address  by 
exploiting  the  semantics  of  prefix  preserving. 


■  Structure  recognition  I - S  recognizing  structure  between 

anonymized  and  unanonymized  objects 
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Attacks  vs.  Anonymisation  Techniques 


^^\^Anonymi  sati  on 

Attacks 

Prefix¬ 

preserving 

Cryptographic 

approach 

T  runcation 

Permutation 

Semantic  attack 

■ 

■ 

Cryptographic  attack 

■ 

Data  1  njection 

■ 

■ 

■ 

Fingerprinting 

■ 

■ 

■ 

Structure 

■ 

■ 

■ 

Recognition 

■  the  attack  can  be  used,  (partial)  results  achieved 
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Conclusions 


■  We  need  to  pay  attention  to  data  protection  laws 


■  Anonymisation  is  part  of  the  solution  to  protecting 
privacy,  but 

-  Research  is  still  needed 

-  This  is  not  only  a  technical  problem;  a  technical  solution 
alone  is  not  enough 

■  Legal  solutions,  policies,  guidelines,  interdisciplinary 
work  are  needed 

■  Anonymisation  support  is  needed  in  standard  flow 
data  export  protocols  such  as  I PFI X 


E.  Boschi,  R.  Gramigna  FloCon  2008,  Savannah,  GA,  USA 
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Automatic  anomaly 
detection  using  NfSen 


-  SURFnet  and  netflow  anomaly  detection 

-  NERD 

-  NfSen 

-  PeakFlow  SP 

-  Currently  used  methods 

-  DDos  detection 

-  Botnet  detection 

-  Holt-Winters  aberrant  behavior  detection 
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SURFnet 
anomaly 


and  netflow 
detection 


-  NERD  vl 

-  Developed  by  TNO 

-  Based  on  cflowd 

-  cflowd  is  no  longer  supported 

-  NERD  v2 

-  Initially  developed  by  TNO 

-  Has  serious  performance  problems 

-  NfSen  can  do  the  same  but  without  the 
performance  problems 
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-  Netflow  Sensor  (NfSen)  is  a  network  statistics  tool 

-  Developed  by  Peter  Haag 

-  Currently  in  active  development 

-  Alert  plug-in  system 

-  Generic  plug-in  system 

-  Some  plug-ins  already  available 
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DDoS  detection 
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I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I 

detection 


-  Simple  flow  analysis  based  on  NERD  vl  DDos 
detection: 

-  Low  threshold 

-  High  threshold 

-  Rules  for  traffic  between  those  thresholds 

-  Custom  thresholds  for  high  load  services 
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Expected 


traffic 
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Definitively  Suspicuous 
Traffic 
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Flows 
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High  load  servers 
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Custom 


thresholds 
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DDos  interface 


report 
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DDos  interface:  Details 
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Top  10  flow 5  per  5  minutes  at  2007-08-21  13:56:05: 

flows  bytes  port  usage 


203  3513 10  min:  1042,  max: 4990  2007-08-21  13:06:04  ReP°rtP°rtscan  |  analyse  | 

16S  298160  min:  1064,  max:  4647  2007-08-21  13:11:04  RaPortP?rtsl:an  |,  analyse  | 

163  23621  min:  1 1 03 ,  max:  4990  2007-08-2 1  12:5 1 :02  RepQ?  p°?  s-ca.n  J  I 
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Botnet  detection 
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Botnet 


detection 


-  Hosts  infected  by  viruses  connect  to  hosts  known  as 
botnet  controllers 

-  List  of  botnet  controllers  are  available,  for  example: 


-  Our  plug-in  logs  all  hosts  that  connect  to  known  botnet 
controllers 

-  Automatically  reports  to  incident  report  system  using 
IODEF 
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Botnet  IODEF  reports 


lang=MenM> 


<?xml  version^" 1 . 0 "  encoding=Miso-8859-lM ?> 

<io : IODEF-Document  xmlns : io="urn : ietf iparams : xml : ns : iodef-1 . 0" 

<io: Incident  purp  ”  ' 

<io:incidentiD  p- — f - \  -  lncidentdetailsSURFcert#019038 

<io  .  StartTime>2  \ -  J  I  Main  menu  I  Importqueuel  Incidents!  Search!  Close  current  incident  I  Mail  templates  I  Edit  settings  I  Logout  I 

<io : EndTime>200 

<io : ReportTime> 

<io:  Assessment  (Bewerken)  Externe  identificatie: 

<io: Impact  ty - 

</io : Assessment  (Bewerken) Ticket  number(s): 

<io :  Contact 

<io :  contactNa  Elementaire  incidentgegevens 

</io :  Contact 
<io : EventData> 

<io :Method> 

<io : Ref eren 
<io : Ref er 
</io : Ref ere 


incidentsoort 
incidenttoestand 
Incidentstatus 
Datum  van  incident 
Logboekinformatie 


</io :Method> 

<io : Flow> 

<io : System 
<io :Node> 

<io : Add 
<io : Cou 
</io iNode 
</io : System 
<io : System 
<io :Node> 

<io : Add _ 

</io :Node 

<io :  servi  Bei'nvloedde  IP-adressen 

<io : Por 


infected 


a 


spection  requested  [▼] 


open 


a 


fllBHaug  B||2QQ7  |  1 17  B|  |  02  B|  [^~g| 


Source  (ip) 

192 .168.1.1 

Target  (ip:port) 

192.163.1.2 

Packet  (type: count) 

flow: 23 

Start  time 

2007-03-13T15: 07: 47+02 :00 

End  time 

2007-08-13T21: 06: 12+02 :00 

update 


</io :  Serv  |p  adres  Machinenaam  Constituency  Rol  in  incident  Bewerken  Verwijder 

</io : System 

</io :  Flow>  192.168.1.1  infected. host  utwente.nl  Unknown  bewerken  verwijderen 

</io : EventData> 

<io  :  AdditionalD  |p  adres  Unknown  p]  |  Toevoegen 

NFSen</ io : Additions 

</io : Incident>  - 

</io : IODEF-Document> 
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Holt-Winter  abarrent 
behavior  detection 
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Holt-Winters  aberrant 
behavior  detection 


-  Uses  information  about  periodic  data  to  predict 
aberrant  behavior. 


Thu  Aug  16  23:40:QG  2007  Flows/s  any  protocol 


A 
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Winters 


Example 


Tue  Aug  7  12:QQ:QG  2G07  Flows/s  any  protocol 


10  k 


Tue  00:00  Tue  04:00  Tue  OS: 00  Tue  12:00  Tue  16:00  Tue  20:00  Wed  00:00 

□  Trill ian  □  Arthur  □  Zaphod  ■  Ford 
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Holt-Winters: 

Original  implementation 


Trend 


Periodic  information  Expected  Noise 


Prediction 
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Limitations  of  the 
original  implementation 


The  original  algorithm  has  three  parameters: 

-  One  that  defines  the  weight  of  historical  data 

-  One  that  defines  the  weight  of  the  trend 

-  One  that  defines  the  amount  of  expected  noise 

The  original  algorithm  has  a  constant  learning  rate 

-  With  a  low  learning  rate,  the  selection  of  the  initial  values  is  critical.  This  will 
introduce  false  positives  for  a  long  time 

-  With  a  high  learning  rate,  the  model  will  likely  be  overfitted.  This  will 
introduce  false  negatives 

The  trend  parameter  has  no  significant  influence  with 
the  resolution  we  are  using 
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Holt-Winters: 
Multiple  trends 


Network  traffic  time  series  often  show  multiple 
recurring  patterns 
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Holt-Winters: 
Multiple  periods 
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Learning 


/  i  /  / 


Fixed  learning  rate: 

The  first  pattern  is  overweighted 


6  iteration 


4  5  6 

prediction 


Adaptive  learning  rate: 
The  weight  of  the  first  pattern 
is  relative  to  the  rest 


I, 


prediction 
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Real 


data  example 
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Bits/s  proto  ICMP 


Holt  Winters: 
Usage  Example 


Thu  Jul  26  19:45:00  2007  Bits/s  proto  ICMP 


Tliu  03:00  Thu  12:00  Thu  1G:  00  Thu  20:00  Fri  00:00  Fri  04:00 


□  Trillian  □  Arthur  M  Zaphod  ■  Ford 


Normal  ICMP  Traffic 


Aberrant  ICMP  Traffic: 
Caused  by  DDos  attack 
by  Stormworm 
botnet 


Mon  Jul  23  19:45:00  2007  Bits/s  proto  ICMP 


□  Trillian  □  Arthur  B  Zaphod  ■  Ford 
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Rows/s  any  protocol 


Holt  Winters: 

Other  possible  uses 


Tue  Aug  14  02:25:00  2007  Flows/s  any  protocol 


Common  SMTP  Traffic 


Wed 


Thu 


Fri 


Sat 


Sun 


Hon 


Tue 


H  SMTP 


Last  week  SMTP  Traffic 
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-  Simple  netflow  analysis  can  be  a  great  added  value  to  network 
security  for  ISP's 

-  Holt-Winters  analysis  has  some  quirks: 

-  Parameters  are  hard  to  understand  and  hard  to  choose 

-  The  trend  parameter  (mostly)  has  no  significant  value  in  network 
analysis 

-  Ignoring  other  patterns  decreases  accuracy 

-  But  an  addapted  version  can  improve  this: 

-  A  flexible  learning  rate  can  simplify  the  selection  of  the  parameters 

-  The  trend  parameter  can  be  removed 

-  We  can  look  at  more  than  1  pattern 
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-  DDoS  plugin: 


-  Botnet  and  Holt-Winters  plugins: 

-  In  development,  but  contact  me  if  you  want  to  try  it: 
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Wim  Biemolt 
Wim.Biemolt@surfnet.nl 


Werner  Schram 
Werner.Schram@surfnet.nl 
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REDJACK 


AMP-Based  Flow 
Collection 


Greg  Virgin  -  RedJack 


REDJACK 


AMP-  Based  Flow  Collection 

•  AMP  -  “Analytic  Metadata  Producer”:  Patented  US  Government 
flow  /  metadata  producer 

•  AMP  generates  data  including 

•  Flows 

•  Host  metadata  (TCP  stack  information,  software  banners) 

•  Metrics 

•  Purpose  of  this  talk:  To  discuss  the  flow  data  collection 
implications  of  these  additional  data  types  for  forensic  analysis 
(not  just  correlation  and  alerting) 

•  Additional  data  sources 

•  Analysis  scenarios 

•  Collection  schemes 


REDJACK 


Additional  Data  Sources 

•  Core  data  source:  flow  data 

•  Netflow-like  data  with  additional  TCP  flag  information 

•  Flow-derived  data  sources:  port  details 

•  Ports  accepting  connections 

•  Bandwidth  statistics 

•  Additional  data  sources  (Not  appropriate  for  flow  records-  aggregated 
data  sources  by  IP,  not  communication) 

•  TCP  Stack  information  reflecting  running  O/S 

•  Server  Banners  (as  seen  by  the  Internet) 

•  Client  Banners  (as  sent  to  the  Internet) 

•  DNS  Names  collected  from  both  the  DNS  protocol  and  other  protocols 
(NEVER  trust  DNS!) 

•  Search  strings  from  search  engines  (HTTP  “referer”  tags) 


REDJACK 

Scenario  1:  Server  "Importance” 

•  Server  Profile 

•  Configuration  (“Windows  2000”) 

•  List  of  listening  ports  (80,  443) 

•  List  of  available  services  (“IIS/6”) 

•  Domain  name(s)  (“www.golfcarts.com”) 

•  Traffic  Volume  (X  connections  today,  per  week,  per  month) 

•  Associated  search  strings  (“golf  carts”,  “high  performance  golf  carts”) 

•  Why? 

•  Provides  metrics  to  automatically  partition  servers  by  volume,  type,  vulnerability 

•  Provides  forensic  value  through  server  details  often  unavailable  at  time  of  analysis 

•  Flow  analysis  scenarios: 

•  Which  active  servers  were  impacted  by  flow  traffic  /  scans  /  attacks 

•  Scrutinize  payload-bearing  traffic  going  to  these  servers 

•  Make  sure  you’re  not  picking  up  potentially  “normal”  activity  in  other  anomaly  detection 
approaches  (your  concept  of  normal  doesn’t  necessarily  have  to  be  perfect) 

•  Assign  real  world  concepts  to  traffic  activity  and  perform  sanity  checks  through  search 
strings 


REDJACK 

Scenario  2:  DNS  /  Name  Analysis 

•  Naming  Information: 

•  DNS  Response  packets 

•  HTTP  Get  requests,  mail  protocol  name  announcements 

•  Why? 

•  The  current  DNS  implementation  presents  major  risks  because  threats  can 
masquerade  as  well  known  sites 

•  The  web  protocol  is  dominated  by  virtual  servers 

•  We  have  found  interesting  discrepancies  between  DNS  and  naming  in  other  protocols 

•  Dealing  with  hosts  as  domain  names  is  more  natural  (the  purpose  of  the  protocol) 

•  Flow  analysis  scenarios: 

•  Name-based  queries  (possible  with  SiLK) 

•  Names  or  name  checksums  incorporated  into  flow  records  for  web  traffic,  followed  by 
correlation  with  a  name  for  the  IP  once  the  data  is  collected  (helps  with  virtual  servers) 

•  Forensic  analysis  of  traffic  to  or  from  bogus  domain  names  to  determine  potential 
damage  (but  you  have  to  do  the  above  correlation  first) 
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Scenario  3:  Making  IP  Space 
Heterogeneous 

•  Required  data: 

•  Host  Configuration 

•  listening  ports 

•  running  services 

•  Why? 

•  Too  often  IP  space  is  considered  one  big  homogeneous  blob  -  analysis  is  done  on 
traffic  between  nodes  without  considering  types  of  nodes 

•  The  diagnosis  of  activities  such  as  worms  can  be  made  from  hosts  in  a  set  running  the 
same  piece  of  software  rather  than  signature 

•  Flow  analysis  scenarios: 

•  What  has  been  called  a  “similarity”  analysis:  take  an  IP  set  and  run  it  against  host 
profiles  to  provide  statistics  on  what  the  hosts  in  the  set  have  in  common 

•  Flow  analysis  broken  down  by  host  attributes  isn’t  very  common,  so  there  are  a  number 
of  possibilities 


REDJACK 

Scenario  4:  The  "Alternate  Use”  Flag 

•  Marking  flows  for  statistically  significant  attributes  is  marking  flows  based  on 
signatures,  not  necessarily  “new”  data 

•  “Alternate  Use”  refers  to  the  proper  use  of  an  Internet  protocol  without  being 
used  for  the  / purpose  of  the  protocol  (this  is  not  protocol  analysis) 

•  Why? 

•  This  type  of  traffic  can  be  a  huge  portion  of  the  traffic 

•  Of  unique  DNS  names  seen  by  your  network,  more  than  half  of  them  may  come  from 
just  a  handful  of  sources 

•  Flow  analysis  scenarios: 

•  Often  port  and  protocol  numbers  are  considered  synonymous  with  legitimate  use  of 
protocols;  this  can  be  used  to  filter  out  alternate  uses 

•  Most  of  the  “alternate”  uses  for  DNS  appear  to  be  spam  reporting,  that  information 
could  be  harvested 
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Scenario  5:  IDS  Verification 


•  Use  host  information  or  flow  data  to  validate  IDS  records 

•  If  hosts  aren’t  running  the  software  that  IDS  signatures  think  they  are. . . 


•  Not  a  new  concept  and  done  in  practice 
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Summary  of  Scenarios 

•  New  data  sources  can  be  used  with  flow  data  to: 

•  Add  contextual  information  and  increase  situational  awareness 

•  Create  filters  that  could  be  useful  for  both  queries  and  data  collection 

•  Partition  data  into  bins  or  streams  with  more  (or  less)  analytic  meaning 

•  The  best  result  is  for  these  techniques  to  impact  the  data  or  be  recorded  as 
additional  data 

•  This  has  an  obvious  impact  on  collection  infrastructure 

•  Data  production  software  should  be  able  to  mark,  reformat,  or  drop  flow  data  based  on 
this  information 

•  Data  collection  and  storage  software  should  be  able  to  process  or  partition  this 
information 

•  Since  most  of  these  techniques  don’t  amount  to  much  more  than  a  filter  definition,  a 
registry  for  these  filters  that  different  parts  of  the  flow  collection  infrastructure  can  use  is 
appropriate 
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New  Sensor  Attributes 

•  (This  is  in  addition  to  flows  with  TCP  options,  host  information,  and  DNS) 

•  Filters  based  on  additional  information 

•  Domain  name  value  for  the  web  protocol 

•  “Alternate  Use”  flag 

•  Not  yet  discussed: 

•  Change  ICMP  to  include  third  IP  address  in  some  instances 
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New  Data  Collection  Attributes 

•  Marking  or  partitioning  flows  with  domain  names 

•  Metrics,  filtering,  and  additional  aggregation  (flows  for  large  servers  can  be 
compacted) 
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New  Data  Store  Attributes 


•  Flow  data  closely  tied  to  new  data  sources 


•  Registry  for  filtering  techniques  that  can  be  leveraged  by  the  sensor  and 
collection 


•  Questions? 

•  Greg  Virgin,  greg.virgin@redjack.com 
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Problem  Statement 

■  Trade-off  in  network  traffic  information  collection  for  incident  analysis 

Raw  packet  traces:  finest  level  of  detail  but  impractical  to  manage  and  search 
Flow  traces:  high-level  traffic  abstraction  but  aggregated 

■  Traditional  flow  exports  may  not  provide  traffic  details  required 

to  understand  causes  of  incidents 

Missing  layer  3  and  layer  4  header  information 
No  packet  content  information 

■  Flow-level  information  is  still  a  considerable  amount  of  data 

Flow  record  collections  are  still  tedious  to  search,  store,  and  analyze 
Majority  of  this  (raw)  information  is  never  accessed 
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Objectives  and  Goals 

Extend  a  collector  system  to  provide  more  accurate  incident  analysis 
Adapt  information  granularity  depending  on  relevance  of  the  traffic: 

Focus  in  on  particular  traffic  events  to  obtain  more  details 

Compress  known/less  relevant  traffic  events  (conserve  a  meaningful  abstraction) 


r~  2. 
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Increasing  Traffic  Information  Granularity 

■  Problem 

Collecting  detailed  traffic  information  is  cumbersome 

Fixed  and  limited  amount  of  information  in  default  flow  exports  (e.g.,  NetFlow  v5) 
Valuable  information  may  have  been  lost  along  with  flow  aggregation 


■  Traditional  approach  (on-going  anomaly) 

Physically  attach  a  probe  or  packet  dumping  device  at  router  (e.g.,  tcpdump  with  filtering) 
Collection  of  rigid  traffic  information  (e.g.,  entire  packets):  complex  analysis 


How  to  simplify  data  collection?  Create  !oom  Monitors! 

Dynamically  controlled  collection  of  traffic  information  at  desired  level  of  detail 
Central  management  console  for  coordination 

Make  use  of  capabilities  of  network  device  inventory  (routers,  switches):  reporting/dumping 
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Zoom  Monitors 


■  Specification 

Metering  point  and  collector  device 
Zoom  monitor  lifespan 
Filter  criteria 

Traffic  aspects  to  be  exported 

■  Export  collection  and  display 


Collector  device 


Reconfigure  metering  device  to  create  specific  exports 
Prepare  collector  device  to  store  exported  traffic  information 
Centralized  management  and  display 


■  Examples 

Show  me  the  payload  of  all  DNS  requests  of  host  10.3.4.5  during  the  next  10  minutes 
Look  for  all  internal  hosts  scanning  on  TCP  service  port  9996  (e.g.,  candidate  worm  traffic) 
Inspect  GET/POST  requests  and  virtual  servers  accessed  on  web  server  10.4.5.6 
Export  unsampled  flow  measurements  from  subnet  10.9.3.1/24 
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Decreasing  Traffic  Information  Granularity 


r< — >o 

f> 

— T 

L< — 

b> 

■  Problem 

Most  stored  traffic  information  is  irrelevant  for  incident  analysis  (never  accessed) 
Redundancy  (limited  value):  Increased  storage  overhead  and  search  complexity 


■  Traditional  approaches 

Rolling  database  (FIFO):  keep  all  records  up  to  a  limit  (e.g.,  #records,  age):  information  removal 
Uniform  summarization:  adapt  resolution  of  information  (hourly,  daily,  weekly) 

Keep  top-k  entries  (according  to  some  aspect) 

■  How  can  we  do  better? 

Majority  of  network  events  is  known  or  recurring 

Gradually  compress  information  of  irrelevant  traffic  events  in  a  lossy  fashion 
With  minimal  impact  on  incident  analysis  tasks 
Summarize  similar  events  (coarse-grained  representation) 
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Observations 


=  exported  flow  record 
=  inactive/active  timeouts 


■  Flow  exports 

k>l  ioi 

n  a 

io: 

Avl 

k>: 

Multiple  exports  for  a  single  connection 

Examples: 

£ 

1  2 

3 

£ 

4 

Long-lived  connections  (streams,  remote  sessions,  etc.) 
Timeouts  on  routers  (inactive/active  timeout) 

:v 

1  ^ 

■  Bi-directionality 

Most  flows  have  a  reversed  counterpart 

i 

2 

3 

- W 

■  Information  similarity 

Sets  of  records  with  limited  added  value  on  the  flow  level 

Groups  of  flows  with  similar  properties 
(Web,  mail,  printer  traffic,  polling) 

Uniqueness:  ephemeral  port,  time  stamps, 
byte  and  packet  counters 
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Compression  Model1 


Raw  exports 
Flow  definition 
Direction 

#  Flow  records 

#  Flows 

#  Conversations 


c= 

X 

y< _ >o 

f> 

— ? 

z: 

^  J 

b> 

Yes 

No 

No 

No 

Yes 

Yes 

Yes 

No 

(subset  thereof) 

Uni-directional 

Uni-directional 

Bi-directional 

Bi-directional 

1 

>  1 

>  1 

>  1 

1 

1 

1  or  2 

>1  or  >  2 

1 

1 

1 

>  1 

without  prior  knowledge  such  as  domain  or  application  specific  information 
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Implementation 

■  Metering  device  configuration  for  Zoom  Monitors 

Reconfiguration  of  metering  devices 
Management  console 

■  Export  collector 

Collection  and  storage 
Traffic  information  compression 
Data  querying 
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Metering  Device  Configuration 

■  Technologies 

-  Cisco  IOS  Flexible  NetFlow  (FNF) 

Configuration  of  multiple  customized  monitors 

Currently:  input  filtering  for  FNF  monitors  not  available  (input  filters  needed  at  collector) 
Hespera  Traffic  Meter  (IBM  Research) 

Software-based  flow  monitor  supporting  NetFlow  v5  and  v9,  IETF  IPFIX  exports 
Customized  flow  exports  (variable  templates),  CLI-based  reconfiguration 
Filtering  with  BPF  filter  syntax 


■  User-based  creation  of  dynamic  zoom  monitors 

Web-based  specification  of  zoom  monitors 

Deployment  on  metering  device  (CLI-based)  and  management  (e.g.,  lifespan) 
Future:  XML-based  configuration  (cf.  [Dimitropoulos/Kind]  or  [NetConf]) 
Registering  the  zoom  monitor  at  collector  device  (for  disambiguation/triage) 
Pre-defined  zoom  monitor  templates  from  library 
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Export  Collector 

■  Prototype  based  on  the  Aurora  flow  analyzing  system  (IBM  Research) 

Replaced  existing  Aggregation  Database  (ADB)  with  PostgreSQL  (PG)  backend 
Input  triage  according  to  zoom  monitors 

Incremental  population/gradually  remove  detailed  representation:  keep  “Session” 


Aurora  project 


reconfiguration 
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Create  New  Zoom  Monitor 

^-|  Zoom  Monitor  ] - 


Name 

Description 


[ 


ir 


Filter 


Load  existing  template:  Destination  address  Destination  prefix  Empty  template 


^-|  Export  template" 


Router  and  Interface 


Router 

.zurich.ibm.com  |v 

Interface 

FastEthernet 1/0  (  )  [v 

Direction 

input  v 

r- 1  Zoom  monitor  lifespan ~]- 

®  Ad-hoc  zoom  monitor 


^-|  Metering  cache  ]- 


Type 
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U  Entries 
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default 

Active  timeout 

30  min 

default 

Inactive  timeout 

10  sec 

default 

IPv4  Information  v  Destination  Address  v 

LE 

IPv4  Transport  v  TCP  v  Destination  port 

3 80 

□  m 

IPv4  Information  v  Source  Address 

v  key  field 

0| 

□B 

IPv4  Information  v|  Protocol 

v  key  field 

□  m 

IPv4  Information  v|  Section 

v|  340 

□  m 
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now  v 

Duration 
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O  Specify  start  and  end  time 

|  Flow  Exporter/Col 

®  Configured  collec 
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tor 

Collector 

(udp ://  :2095)  v 

O  Create  new  collector 


Save  as  template  Create  zoom  monitor 
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Zoom  Results:  Sessions 

— |  Filler  | - 
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Zoom  Results:  Zoom  Monitor  'Payload  Section1 
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Results:  Compression  (WAN  traffic) 


Nb  of  records  in  per  bin 
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CO 


Session  inactive  timeout:  20min 


Average  compression  ratio 

#flow  records  :  #flows  1.26  a  =  0.07 

#flow  records  :  Conversations  2.34  o  =  0.28 
#flow  records  :  #sessions  22.80  o  =  7.00 


Nb  of  records  in  DB 
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Traffic  Collection  for  Incident  Analysis 


After-the-fact  analysis 


Initial  guess 


Query  collected 
data  in  DB 


Refine  assumptions 


Reproduce 
event  trail 


Understand/ 
Infer  causes 


Conclude 


Real-time  analysis 


Future  incident  trap 


k  J 

Understand/ 

1 

Infer  causes 

k. _ ^ 

Conclude 


Conclude 
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Future  Work  and  Vision 

■  Automated  zoom  monitor  creation 

Interface  to  a  behavior-based  network  anomaly  detection  system 

Proactive  collection  of  evidence  for  off-line  forensic  analysis  of  abnormal  events 


■  Distributed  collector  infrastructure 

Distributed  collectors,  e.g.,  at  multiple  sites  (scalability) 

Transfer  required  information  to  central  reporting  system  on  demand 


■  Cisco  IOS  Flexible  NetFlow  with  input  filters 

Perform  filtering  on  routers  to  replace  software-based  metering  (and  filtering) 
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Conclusion 

■  Incident  analysis  tool  adapting  flow  information  granularity 

Increase  level  of  detail  of  relevant/unknown  traffic  events 
Decrease  level  of  detail  (lossy  compression)  of  less  relevant  events 
Keep  a  meaningful  abstraction  of  all  traffic  events 


■  Creation  of  customized  zoom  monitors 

Zoom  in  on  specific  traffic  to  gain  additional  information  about  its  properties  and  behavior 
Centralized  management  of  metering  devices  for  traffic  detail  collection 
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Problem  Statement 

Trade-off  in  network  traffic  information  collection  for  incident  analysis 

Raw  packet  traces:  finest  level  of  detail  but  impractical  to  manage  and  search 
Flow  traces:  high-level  traffic  abstraction  but  aggregated 

Traditional  flow  exports  may  not  provide  traffic  details  required 

to  understand  causes  of  incidents 

Missing  layer  3  and  layer  4  header  information 
No  packet  content  information 

■  Flow-level  information  is  still  a  considerable  amount  of  data 

Flow  record  collections  are  still  tedious  to  search,  store,  and  analyze 
Majority  of  this  (raw)  information  is  never  accessed 
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Objectives  and  Goals 

Extend  a  collector  system  to  provide  more  accurate  incident  analysis 
Adapt  information  granularity  depending  on  relevance  of  the  traffic: 

Focus  in  on  particular  traffic  events  to  obtain  more  details 

Compress  known/less  relevant  traffic  events  (conserve  a  meaningful  abstraction) 
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Increasing  Traffic  Information  Granularity 

Problem 

Collecting  detailed  traffic  information  is  cumbersome 

Fixed  and  limited  amount  of  information  in  default  flow  exports  (e.g.,  NetFlow  v5) 
Valuable  information  may  have  been  lost  along  with  flow  aggregation 


Traditional  approach  (on-going  anomaly) 

Physically  attach  a  probe  or  packet  dumping  device  at  router  (e.g.,  tcpdump  with  filtering) 
Collection  of  rigid  traffic  information  (e.g.,  entire  packets):  complex  analysis 


Howto  simplify  data  collection?  Create  Zoom  Monitors! 

Dynamically  controlled  collection  of  traffic  information  at  desired  level  of  detail 
Central  management  console  for  coordination 

Make  use  of  capabilities  of  network  device  inventory  (routers,  switches):  reporting/dumping 
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Zoom  Monitors 


Specification 

Metering  point  and  collector  device 
Zoom  monitor  lifespan 
Filter  criteria 

T raffic  aspects  to  be  exported 


Export  collection  and  display 


Collector  device 


Metering  device 


Reconfigure  metering  device  to  create  specific  exports 
Prepare  collector  device  to  store  exported  traffic  information 
Centralized  management  and  display 


Examples 

Show  me  the  payload  of  all  DNS  requests  of  host  10.3.4.5  during  the  next  10  minutes 
Look  for  all  internal  hosts  scanning  on  TCP  service  port  9996  (e.g.,  candidate  worm  traffic) 
Inspect  GET/POST  requests  and  virtual  servers  accessed  on  web  server  10.4.5.6 
Export  unsampled  flow  measurements  from  subnet  10.9.3.1/24 
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Decreasing  Traffic  Information  Granularity 


Problem 

Most  stored  traffic  information  is  irrelevant  for  incident  analysis  (never  accessed) 

Redundancy  (limited  value):  Increased  storage  overhead  and  search  complexity 

Traditional  approaches 

Rolling  database  (FIFO):  keep  all  records  up  to  a  limit  (e.g.,  #records,  age):  information  removal 
Uniform  summarization:  adapt  resolution  of  information  (hourly,  daily,  weekly) 

Keep  top-k  entries  (according  to  some  aspect) 

■  How  can  we  do  better? 

Majority  of  network  events  is  known  or  recurring 

Gradually  compress  information  of  irrelevant  traffic  events  in  a  lossy  fashion 
With  minimal  impact  on  incident  analysis  tasks 
Summarize  similar  events  (coarse-grained  representation) 
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Observations 


</*  =  exported  flow  record 
=  inactive/active  timeouts 


<= 

y< - >o 

z: 

\> 

Flow  exports 

k>: 

n 

ioi 

A 

ioi 

A  A  I 

io: 

HI. 

Multiple  exports  for  a  single  connection 

Examples: 

i 

£ 

2 

<?  i 

3 

&  \ 

4 

Long-lived  connections  (streams,  remote  sessions,  etc.) 
Timeouts  on  routers  (inactive/active  timeout) 

■  ^ 

& 

& 

1 

2 

3 

Bi-directionality 


Most  flows  have  a  reversed  counterpart 


> 


Information  similarity 

Sets  of  records  with  limited  added  value  on  the  flow  level 

Groups  of  flows  with  similar  properties 
(Web,  mail,  printer  traffic,  polling) 

Uniqueness:  ephemeral  port,  time  stamps, 
byte  and  packet  counters 
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Compression  Model1 


Flow  record 


Abstraction  models 
Flow  Conversation 


Raw  exports 
Flow  definition 
Direction 

#  Flow  records 

#  Flows 

#  Conversations 


1  without  prior  knowledge  such  as  domain  or  application  specific  information 


<= 

y< — >o 

z: 

\> 

Session 


Yes 

No 

No 

No 

Yes 

Yes 

Yes 

No 

(subset  thereof) 

Uni-directional 

Uni-directional 

Bi-directional 

Bi-directional 

1 

>  1 

>  1 

>  1 

1 

1 

1  or  2 

>  1  or  >  2 

1 

1 

1 

>  1 
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Implementation 

Metering  device  configuration  for  Zoom  Monitors 

Reconfiguration  of  metering  devices 
Management  console 

■  Export  collector 

Collection  and  storage 
Traffic  information  compression 
Data  querying 
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Metering  Device  Configuration 

Technologies 

Cisco  IOS  Flexible  NetFlow  (FNF) 

Configuration  of  multiple  customized  monitors 

Currently:  input  filtering  for  FNF  monitors  not  available  (input  filters  needed  at  collector) 
Hespera  Traffic  Meter  (IBM  Research) 

Software- based  flow  monitor  supporting  NetFlow  v5  and  v9,  IETF  IPFIX  exports 
Customized  flow  exports  (variable  templates),  CLI-based  reconfiguration 
Filtering  with  BPF  filter  syntax 


User-based  creation  of  dynamic  zoom  monitors 

Web-based  specification  of  zoom  monitors 

Deployment  on  metering  device  (CLI-based)  and  management  (e.g.,  lifespan) 
Future:  XML-based  configuration  (cf.  [Dimitropoulos/Kind]  or  [NetConf]) 
Registering  the  zoom  monitor  at  collector  device  (for  disambiguation/triage) 
Pre-defined  zoom  monitor  templates  from  library 
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Export  Collector 


Prototype  based  on  the  Aurora  flow  analyzing  system  (IBM  Research) 

Replaced  existing  Aggregation  Database  (ADB)  with  PostgreSQL  (PG)  backend 
Input  triage  according  to  zoom  monitors 

Incremental  population/gradually  remove  detailed  representation:  keep  “Session” 


HTTPs 


reconfiguration 
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Create  New  Zoom  Monitor 

— |  Zoom  Monitor  ] - 


Name 

Description 


[ 


Filter 


^-|  Export  template 


Load  existing  template:  NetFlow  5  Empty  template 


Router  and  Interface 


Router 

.zurich.ibm.com  v 

Interface 

FastEthernet  1  /0  (■  )  v 

Direction 

input  FlB 

^-|  Zoom  monitor  lifespan 

®  Ad-hoc  zoom  monitor 


^-|  Metering  cache" 

Type 


immediate 


- 


U  Entries 

8192 

default 

Active  timeout 

30  min 

default 

Inactive  timeout 

10  sec 

default 

^-|  Flow  Expo rter/Co Hector" 

®  Configured  collector 

Collector 


IPv4  Information  v  Destination  Address  vj| 

mm 

IPv4  Transport  v  TCP  v  Destination  port  v  |  80 

Load  existing  template:  Destination  address  Destination  prefix  Empty  template 

IPv4  Information  [v  Source  Address 

v  key  field 

0| 

mm 

IPv4  Information  |v  Protocol 

v  key  field 

ml 

mm 

IPv4  Information  |v  Section 

v  340 

mm 

Start 

now  v 

Duration 

30  sec  v 

O  Specify  start  and 

end  time 

[udp:// 


:2095)  v 


O  Create  new  collector 


|  Save  as  template^  [  Create  zoom  monitor  | 


Filter  definition 


Export  information 


Router/Interface 

Lifespan 

Collector 

Cache 
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Zoom  Results:  Sessions 


Cli  Bytes  Cli  Pkts  Server  IP  Server  Port  Srv  Bytes  Srv  Pkts  Protocol  Co  livers.  Actions 


2007-11-20  10:10:04 


2007-11-20  10:11:04 


2007-11-20  10:20:03 


2007-11-20  10:26:49 


2007-11-20  10:27:11 


2007-11-20  10:28:48 


2007-11-20  10:32:12 


2007-11-20  10:33:50 


2007-11-20  11:11:05 


2007-11-20  11:36:09 


2007-11-20  10:13:10 


2007-11-20  11:02:48 


2007-11-20  11:18:21 


2007-11-20  11:26:55 


2007-11-20  11:15:50 


2007-11-20  11:15:46 


2007-11-20  11:25:30 


2007-11-20  11:11:33 


Show  conversations 


Flag  session 


Show  conversations 


FlagjR^infL- 


Show  conversations 
Flag  session 


Zoom  Results:  Conversations 
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Zoom  Results:  Zoom  Monitor  'Payload  Section' 
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Results:  Compression  (WAN  traffic) 


Nb  of  records  in  per  bin 
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Session  inactive  timeout:  20min 


Average  compression  ratio 


#flow  records  :  #flows  1.26  a  =  0.07 

#flow  records  :  Conversations  2.34  a  =  0.28 

#flow  records  :  #sessions  22.80  a  =  7.00 
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Traffic  Collection  for  Incident  Analysis 


"  After-the-fact  analysis 


Initial  guess 


Refine  assumptions 


Conclude 


Real-time  analysis 


Conclude 


Future  incident  trap 


Formulate 
incident  criteria 


Create  filtered 
data  collector 


E3 

->  collect  - ► 


Understand/ 
Infer  causes 


Conclude 
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Future  Work  and  Vision 

Automated  zoom  monitor  creation 

Interface  to  a  behavior-based  network  anomaly  detection  system 

Proactive  collection  of  evidence  for  off-line  forensic  analysis  of  abnormal  events 


Distributed  collector  infrastructure 

Distributed  collectors,  e.g.,  at  multiple  sites  (scalability) 

Transfer  required  information  to  central  reporting  system  on  demand 


"  Cisco  IOS  Flexible  NetFlow  with  input  filters 

Perform  filtering  on  routers  to  replace  software-based  metering  (and  filtering) 
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Conclusion 

Incident  analysis  tool  adapting  flow  information  granularity 

Increase  level  of  detail  of  relevant/unknown  traffic  events 
Decrease  level  of  detail  (lossy  compression)  of  less  relevant  events 
Keep  a  meaningful  abstraction  of  all  traffic  events 


Creation  of  customized  zoom  monitors 

Zoom  in  on  specific  traffic  to  gain  additional  information  about  its  properties  and  behavior 
Centralized  management  of  metering  devices  for  traffic  detail  collection 
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Outline 
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-  Entropy  measure 
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•  Case  Study:  Flows 
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Goal 


•  Problem: 

-  The  goal  is  to  transform  original  data  records  so  that  no  sensitive 
personal  data  are  disclosed,  whereas  preserving  the  maximum  amount 
of  relevant  information  ( anonymity  vs.  utility  trade  off),  data  integrity  and 
consistency. 

•  Application 

-  Creating  datasets  for  application  testing,  whenever  production  DB 
contains  sensitive  data.  (Our  original  goal) 

-  Allowing  researchers  to  share  data  and  run  analytical  models  on  micro¬ 
data  (e.g.,  log  files),  preserving  privacy. 
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Goal 


Risk-Utility  Confidentiality  Map 


Anonymisation  engine  &  risk  estimation 


Input  data 

1 


Policy 


Utility  Assessment 


Output 


Copyright  ©  2008  Accenture  All  Rights  Reserved. 


Implementation 


•  Using  FLAIM  (Framework 
for  Log  Anonymization  and 
Information  Management), 
developed  by  NCSA 

•  FLAIM  anonymization 
engine  (adapted)  +  risk 
module 
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Anonymisation  primitives 

IPs 

•  Black  Marker  (16  bits): 

•  Random  Permutation  (one-to-one  mapping) 

•  Prefix-preserving  (random  permutation,  but  preserving  structure) 


IP  Address 

Black  Marker 
(16-bit) 

Random 

Permutation 

Prefix-preserving 

168.125.96.167 

168.125.0.0 

124.12.132.37 

12.131.102.67 

168.125.96.18 

168.125.0.0 

231.45.36.167 

12.131.102.17 

168.125.132.37 

168.125.0.0 

12.72.8.5 

12.131.201.29 

Port  number 

•  Bilateral  Classification:  Replace  with  0  or  65535  (the  port  smaller  or  larger  than 
1024):  E.g.,  27 ->  0,  2048->65535 

Number  of  packets/bytes 

•  Add  random  noise  (zero-average) 

•  Classification 
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Attack  scenario 

•  The  attacker  aims  at  re-identifying  released  data  by  linking  them  with  some  background 
knowledge,  which  has  some  overlapping  attributes  with  the  released  dataset. 

•  Estimating  P(r\s):  knowing  data  masking  transformations,  distance  based  similarity 

•  More  uncertain  mapping  is  -  lower  risk 

•  Because  the  data  holder  does  not  know  in  advance  which  records  and  attributes  might  be 
available  to  the  attacker,  it  must  run  the  risk  analysis  on  the  whole  released  dataset  and 
assume  a  set  of  key  attributes  the  attacker  might  know  and  use  for  re-identification. 


Original  data  S 


SrclP 

SrcPort 

DestIP 

DostPort 

Packets 

168.125.253.2 

80 

147.81.124.1 

3157 

40 

39.109.219.43 

7310 

142.68.22.108 

59959 

126 

35.187.130.82 

161 

213.48.19.68 

22 

83 

Anonymi 

ised  d 

ata  R 

Original, data  S  , , 

SrclP 

SrcPort 

DestIP 

DestPort 

Packets 

SrclP 

SrcPort 

1G  iviiOwVii 

DestIP 

DestPort 

Packets 

168.125.253.0 

1023 

10.1.1.1 

65535 

42 

'  r  1^7 

16$Kff  253.2 

SrcPort 

lfl®tffc4.1 

DetfRprt 

PaofeBts 

39.109.219.0 

65535 

10.1.1.1 

65535 

132 

3 

im 

59959 

126 

35.187.130.0 

1023 

10.1.1.1 

0 

81 

35.187.130.82 

161 

213.48.19.68 

22 
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Estimating  risk 


<  i 

High  Risk  Low  Risk 

of  re-identification  of  re-identification 


H-1.2  H=3.7  H=4.9 

Shannon  entropy:  Average  #  of  binary  questions  to  identify  s 
Small:  risky  Large:  safe 
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Entropy  as  a  risk  measure 


Shannon  entropy:  Average  #  of  binary  questions  to  identify  a 
single  s 

H(TZ\s)  =  -  53  p(r\a)  loSa  J’O’k) 

r&TZ 

Global  risk: 

Expected  number  of  correct  matches 


ecm  —  53 


bQS 


i 

2h(*I») 
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Some  properties 

•  Directly  linked  to  information  loss  (utility): 

I(Sin)=H(K)- 

&  c 

-  Minimal  info  loss: 

with  constraint  H(R\s)  & 

•  Additivity 

H^n a  |s)  =  H{n!  |») +h(k2\'r.us) 
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Case  study:  flow 


•  nfdump  testing  dataset  provided  by  FLAIM  group 

•  1 0000  records 

•  Src/Dst  IPs,  Src/Dst  ports,  Bytes  used 


Date  flow  : 

start 

Duration 

Protc 

VWWWWXA 

Src  IP  Ad dr : Port 

WvWvV  VWWWW 

Dst  IP  Addir  :  Port 

VWvWv  WvWvWv 

Packet  s 

Bytes  FI 

3WS 

[*007-09-15 

21 

09 

11.401 

0.392 

TCP 

93.82.215.84::  36073 

-> 

215.177.13.213:80 

8 

672 

1 

2007-09-15 

21 

09 

12.491 

0 . 000 

UDP 

57.28.244.23:48549 

-> 

204.23.1.67:33467 

1 

40 

1 

2007-09-15 

21 

09 

12.431 

0 . 000 

UDP 

57.28.244.23:48549 

-> 

204.23.1.67:33465 

1 

40 

1 

2007-09-15 

21 

09 

12.356 

0 . 354 

TCP 

89.240.246.94:60717 

-> 

190.0.95.202:3128 

7 

1253 

1 

2007-09-15 

21 

09 

12.127 

0 . 000 

UDP 

154 . 159 . 232 . 119 : 56395 

-> 

204 .23.1.67:33524 

1 

40 

1 

2007-09-15 

21 

09 

11.617 

0 . 000 

UDP 

72 . 252 . 1 . 23 : 53 

-> 

191.69.116.86:4489 

1 

165 

1 

2007-09-15 

21 

19 

20.043 

4294216.796  DDF 

151.117.100.51:111 

- 

> 

106.243.186.60:967 

5 

280 

1 

2007-09-15 

21 

19 

21.348 

1430.067 

UDP 

111.96.210.161:61718 

-> 

70.114.202.209:161 

2 

154 

1 

2007-09-15 

21 

19 

22.694 

0 . 000 

UDP 

169.53.207.33:53 

-> 

247.215.39.74:3337 

1 

329 

1 

2007-09-15 

21 

19 

20.074 

0 . 000 

TCP 

141.245.94.187:39414 

-> 

217.242.169.109:479 

1 

60 

1 

2007-09-15 

21 

19 

21.323 

4293905.249  DDF 

111.96.210.161:51937  - 

> 

80.187.116.29:161 

154 

1 

2007-09-15 

21 

19 

21.314 

1388.111 

UDP 

111.96.210.161:53427 

-> 

80.187.116.29:161 

3 

231 

1 

2007-09-15 

21 

19 

19.139 

0 . 000 

UDP 

169.53.207.33:53 

-> 

99.74. 24 .233:51878 

1 

284 

1 

2007-09-15 

21 

19 

19.321 

0 . 000 

UDP 

169.53.207.33:53 

-> 

99.74.24 .233:51879 

1 

284 

1 

2007-09-15 

21 

19 

21.321 

0 . 000 

UDP 

111.96.210.161:53877 

-> 

80.187.116.29:161 

1 

77 

1 

2007-09-15 

21 

19 

26.305 

4294392.436  UDF 

169.53.207.33:53 

- 

> 

98.14.24.3:509^ 

9 

2 

348 

1 

2007-09-15 

21 

19 

15.297 

69.143 

TCP 

121.191.230.139:25 

-> 

135.219.55.50:1674 

4 

291 

1 

2007-09-15 

21 

19 

21.375 

5.023 

TCP 

103.6.42.145:20144 

-> 

88.118.84 .209:51024 

552 

28712 

1 
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Risk  as  the  percentage  of  expected  correct  matches 


100 


80 


60 


40 


20 


BM  (16)  BM  (16)  BM  (24)  Random  BM  (32) 

IPs  IPs  IPs  Perm  IPs 

IPs 

Classify  C  ports  C  ports  C  ports 

Ports  Noise  Noise  ^  ports  N0jse 

Bytes  Bytes  Noise  Bytes 

Bytes 


Final  remarks 


Quantifying  disclosure  risk  is  essential  for  finding  the  optimal  trade-off 
between  privacy  and  utility. 

Measure  disclosure  risk  using  entropy: 

-  General:  applicable  to  any  anonymization  algorithm  (unlike  k- 
anonymity) 

-  Stable:  depends  on  shape  of  the  distribution 

-  Linked  to  Information  Theory 

Future  works  (a  lot...): 

•  More  realistic  testing  (larger  dataset,  correlation  across  fields/records) 

•  Utility,  Optimisation,  ... 
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Thanks  for  the  attention 
Michele.  bezzi@accenture.  com 
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Simplifying  the 
configuration  of  flow 
monitoring  probes 


Xenofontas  (Fontas)  Dimitropoulos  (xed@zurich. ibm.com) 
Andreas  Kind  (ank@zurich. ibm.com) 


Zurich  Research  Laboratory 


Outline 

■  Background  and  motivation. 

■  Probe  configuration  architecture: 

Requirements  and  goals. 
Design. 

Implementation. 

■  Future  work  and  conclusions. 
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Network  configuration 


■  Network  elements  are  typically  configured  with  low-level  commands,  e.g., 
Cisco  IOS  commands. 

■  Network  administrators 
manage  numerous  network 
elements  with  lengthy 
configuration  files. 

■  Network  configuration  is 
an  error-prone  and 
time-consuming 
process. 

■  Configuration  errors  can 
be  costly,  e.g.: 

network  outages 
violations  of  SLAs 
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Probe  configuration 


■  The  configuration  of  monitoring  probes  is  part  of  the  more  general  network 
configuration  problem. 

■  Monitoring  probes  are  gradually  becoming  more  intelligent,  for  example, 
using  advanced  sampling  and  data  aggregation  techniques.  Consequently, 
their  configuration  becomes  more  involved. 

■  Flexible  Netflow  (FNF)  and  IPFIX  provide  numerous  configuration  options 
that  were  not  available  earlier: 

FNF  has  58  different  configuration  commands. 

FNF  provides  65  different  fields,  arbitrary  combinations  of  which  can  be  used  in 
the  definition  of  flow  key  and  non-key  fields. 

■  Certain  network  operation  applications  need  to  dynamically  change 
configuration  to: 

adapt  to  changing  traffic  conditions, 
investigate  on-going  network  anomalies. 
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Configuration  requirements 


traffic 

billing 

anomaly 

application 

traffic 

profiling 

detection 

identification 

engineering 

> 

network 

operation 

applications 


network  monitoring  probes 
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Configuration  requirements 


traffic 

billing 

anomaly 

application 

traffic 

profiling 

detection 

identification 

engineering 

> 

network 

operation 

applications 


application 

needs 

_ i _ 

Probe 

configuration 


low-level 

configuration 


network  monitoring  probes 


■  Probe  configuration  should: 

take  into  account  application  needs, 
be  aware  of  the  available  monitoring  probes. 

3  generate  low-level  configuration  commands. 

4  configure  or  update  the  configuration  of  probes. 
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Probe  configuration  architecture 


Three  modules: 

the  measurements  module 
describes  different 
measurements,  i.e.,  application 
needs. 

the  inventory  module  describes 
the  monitoring  probes  of  a 
network. 

the  back-end  module  provides 
necessary  information  for 
generating  low-level  commands. 

The  specification  identifies 
application  needs. 

The  configurator: 

uses  the  modules  and 
specification  to  generate  low-level 
commands. 


Monitoring  probes 


configures  the  probes 
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1. 

2. 

3. 

4. 

5. 


goals  for  simplifying  configuration 


Abstraction:  hide  low-level  configuration  commands. 

Objective-oriented  configuration  expression: 

express  configuration  in  terms  of  measurement  objectives, 
focus  on  measurements  instead  of  devices. 

Network-wide  configuration:  configure  a  network  instead  of 
configuring  individual  devices. 

Re-usability:  make  parts  of  configuration  network-independent. 

Extensibility:  easily  introduce  support  for  new  commands, 
measurements,  etc. 


1. 

2. 

3. 

4. 

5. 
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Configuration  abstraction  hierarchy 


■  1 st  level:  vendor-specific  configuration 
commands. 

■  2nd  level:  probe  elements  (pe),  i.e., 
logical  components  of  a  probe,  like 
interface,  flow  cache,  exporter. 

■  3rd  level:  configlet,  i.e.,  a  set  of  specific 
probe  elements  that  realizes  a 
measurement. 

■  4th  level:  measurement  services,  i.e.,  a 
configlet  with  certain  probe  selection 
rules. 
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Back-end  module 


■  Specifies  different  probe 
elements. 

■  A  probe  element  specification: 

is  written  in  XML. 

has  a  unique  id. 

identifies  parameters 
and  parameter  default 
values. 

determines  the  low-level 
vendor-specific  commands. 


<!-  Probe  Element  Exporter --> 

<pe  id=‘generic_exporter'> 

<params> 

<param  id='port'>90</param> 

<param  id='transport'>udp</param> 

<param  id='destination'>l  92.0.0. 1  </param> 

<param  id- laber>EXPORTER</param> 

</params> 

<template> 

<ios> 

flow  exporter  $label 
destination  $destination 
transport  $transport  $port 

</ios> 

<yaf> 

--out  $destination  — ipfix  $transport  --ipfix-port  $port 
</yaf> 

<junos> 

</junos> 

</template> 

</pe> 
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■  III 
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Inventory  module 


■  Specifies  network  probes,  i.e.,  lists 
the  characteristics  that  can  be 
useful  for  their  configuration. 


■  Besides  describing  location, 
system,  and  interface  information, 
it  declares  tags  that  can  be  used 
for  grouping  probes  and  for  probe 
selection. 


<probe  id-trabant.zurich.ibm.com1> 
<address>9.4.68.1 54</address> 

<location> 

<city>Zurich</city> 

<state>Central  CH</state> 
<country>Switzerland</country> 
</location> 

<system> 

<os>ios</os> 

<version>l  2.4</version> 

</system> 

interface  id- FastEthernetO/0'> 
<capacity>100Mbits</capacity> 
<tag>internal</tag> 

</interface> 

<interface  id='FastEthernetO/r> 
<capacity>  1  OOMbits</capacity> 
<tag>customer</tag> 
</interface> 

<tags> 

<tag>edge</tag> 

</tags> 

</probe> 
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Measurements 

module 


<!-  Monitor  how  much  traffic  is  send  --> 

<!-  between  IP  blocks.  --> 

<msr  id='traffic_matrix'> 

<params>  <!--  Default  parameter  values  --> 

<param  id- collector_address'>localhost</param> 
<param  id='collector_port'>2055</param> 

<param  id='collector_transport'>tcp</param> 
</params> 

<!--  Probe  element  chain  -> 

<configlet> 

</configlet> 

<rules> 

</rules> 

</msr> 
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Measurements 

module 


<!--  Probe  element  chain  -> 
<configlet> 


<pe> 


<name>exporter</name> 

<params> 

<param  id='laber>TM_EXPORTER</param> 

<param  id='destination'>$collector_address</param> 
<param  id='port'>$collector_port</param> 

<param  id='transport'>$collector_transport</param> 
</params> 


</pe> 

<pe> 


<name>flow_cache</name> 

<params> 

<param  id- laber>TM_CACHE</param> 

<param  id='record'>SRC_DST_PREFIX_REC</param> 
<param  id='export'>TM_EXPORTER</param> 
</params> 


</pe> 

<pe> 


<name>interface</name> 

<params> 

<param  id='monitor'>TM_CACHE</param> 
<param  id='interface'>$interface->id</param> 
<param  id='direction'>output</param> 
</params> 


</pe> 

</configlet> 


13 


X  Dimitropoulos  |  Systems  Department  |  IBM  Research 


■  III 
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Measurements 

module 


<rules> 

<interface> 

if  (  $interface.tag  eq  “external"  and 
$probe.tag  eq  "edge" )  { 
return  1; 

}  else  { 
return  0; 

} 

</interface> 

</rules> 
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Input  specification 


■  Lists  the  measurements  and  the 
probes  in  which  to  enable  these 
measurements. 


■  Is  the  user  interface  and  can  be 
generated  through  a  GUI. 
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<!--  Probes  to  apply  measurements  on  --> 
<probe  id='wassen.zurich.ibm.com'x/probe> 
<probe  id- trabant.zurich.ibm.com'></probe> 


<!-  Measurements  --> 

<msr  id- traffic_matrix'> 

<params>  <!--  overwrite  default  values  --> 

<param  id- collector_address'>9.4.68.204</param> 
<param  id- collector_port'>2055</param> 

<param  id='collector_transport'>udp</param> 
</params> 

</msr> 

<msr  id='app_monitoring'> 

<params>  <!--  overwrite  default  values  --> 

<param  id='collector_address'>9.4.68.205</param> 
<param  id='collector_port'>2055</param> 

<param  id- collector_transport'>udp</param> 
</params> 

</msr> 


Research 


Zurich  Research  Laboratory 


1. 

2. 

3. 

4. 

5. 


goals  for  simplifying  configuration 


Abstraction:  hide  low-level  configuration  commands. 

Objective-oriented  configuration  expression: 

express  configuration  in  terms  of  measurement  objectives, 
focus  on  measurements  instead  of  devices. 

Network-wide  configuration:  configure  a  network  instead  of 
configuring  individual  devices. 

Re-usability:  make  parts  of  configuration  network-independent. 

Extensibility:  easily  introduce  support  for  new  commands, 
measurements,  etc. 


1. 

2. 

3. 

4. 

5. 
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Conclusions 


■  Described  an  architecture  for  automating  the  configuration 
of  flow  monitoring  probes. 

Configuration  abstraction. 

Reuse  configuration. 

Extensibility. 

■  Future/on-going  work: 

Incorporate  error-checking  techniques. 

Develop  libraries  for  typical  measurements. 

Configuration  optimization. 

Use  NetConf. 
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High  Level  Flow  Correlation 


Valentino  Crespi,  California  State  Los  Angeles,  CA 
Annarita  Giani,  UC  Berkeley,  CA 
Rajiv  Raghunarayan,  Cisco  Systems,  Inc. 
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Outline 


1 .  Extension  of  previous  work  on  Flow  Aggregation, 
(Flocon  2006). 

2.  Embedding  of  network  traffic  in  an  Euclidian  Space. 

3.  Complex  modeling  through  clustering. 

4.  Planned  work. 
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Outline 


1 .  Extension  of  previous  work  on  Flow  Aggregation, 
(Flocon  2006). 

2.  Embedding  of  network  traffic  in  an  Euclidian  Space. 

3.  Complex  modeling  through  clustering. 

4.  Planned  work. 
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Behind  Flow  Aggregation 


A 

Flow  Aggregates 


A 


FLOWS 

Thousands  per  hou 


PACKETS 


Hundreds  of  thousands  per  hour 


BYTES,  million  per  hour 

How  data  move 


•  Monitoring 

•  Anomaly  detection 

•  Security  analysis 

•  Traffic  profiling 

•  Debugging 

•  Traffic  engineering 

•  Usage-based  profiling 

•  Network  planning 

•  Pricing,  peering 


Data  Reduction  =  Fewer  events  to  be  analyzed 
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Our  Previous  Work 

A.  Giani,  I.  De  Souza,  V.  Berk,  G.  Cybenko,  "  Attribution  and  Aggregation  of  Network  Flows  for 

Security  Analysis  ,"  in  Proc.  Flocon  2006,  Portland,  OR. 


We  believe  that  automated 
correlation  at  the  raw  flow  level 

is  complicated  and  susceptible  to 
false  positives.  The  world  consists 
of  processes  so  our  approach  to 
correlation  is  process-based.. 

Flow  aggregation  and  correlations 
between  flow  data  with  security 
events 

Implementation  of  a  PQS  based 
process  detection  for  Cyber 
Situational  Awareness. 


Flow  +  Snort  Alerts 

Scenario:  several  packets  in  a  flow  triggered  IDS  alerts 


Snort  rule  1560 
generates  an  alert 
when  an  attempt 
is  made  to  exploit  a 
known  vulnerability 
in  a  web  server  or  a 
web  application. 


Snort  rule  1852 
generates  an  alert 
when  an  attempt  is 
made  to  access 
the  ’robots.txt'  file 
directly. 
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Tabic  2:  A  sample  track  of  correlated  IDS  and  Flow 
events 


The  flow  can  be  characterized  as  malicious  and  further  investigation  must  be  done. 
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Outline 


1.  Extension  of  previous  work  on  Flow  Aggregation, 
(Flocon  2006). 

2.  Embedding  of  network  traffic  in  an  Euclidian  Space. 

3.  Complex  modeling  through  clustering. 

4.  Planned  work. 
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Current  aggregators  and  analyzers 


•  POWERFUL  TOOLS  to  understand  the  behavior  of  the  network  according 
to  certain  parameters,  e.g.  the  amount  of  resources  consumed,  the 
variance  on  the  various  characteristics  of  the  communication  (source  ip, 
destination  ip),  port. 

•  PROBLEM:  They  do  not  provide  an  analysis  and  a  description  of  the 
dynamic  evolution  of  network  traffic. 

•  NEED  for  a  structure  that  summarizes  the  behavior  of  the  network. 

OUR  IDEA 

Combine  flow  aggregation  techniques  with  our  previous  process-based 
approach: 

Use  aggregators  and  flow  analyzers  to  translate  traffic  into  a  process  to 
be  modeled  and  estimated. 
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Build  circuits  of  Aggregating  gates 


1.  Place  observing  nodes  in  multiple  locations  of  the  network  (e.g.  on  each 
local  router). 

2.  Each  observing  nodes  dumps  traffic  flows  to  a  Macro  Aggregator  (MA). 

3.  Macro  Aggregator:  circuit.  Each  gate  is  a  flow  aggregator 

■  First  layer  consists  of  classical  aggregators  that  output  flow 
aggregates.  Successive  layers  process  aggregates  of  flow 
aggregates 

■  Final  output:  a  vector  function  of  the  dumped  traffic  ranging  in  Rn: 

x(f )  =  (x,  (f  ),  x2(t),-,x„  (/)) 

At  each  time  the  observing  nodes  produce  a  set  of  vectors: 

S(0  =  (x1(0,x2«),-,x„(0) 

4.  Identify  and  Analyze  properties  of  S(t)  over  time  to  characterize/detect 
anomalies. 
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Embed  Traffic  in  Euclidean  Space 


(Entropy  S-IP, Entropy  D-IP,  Average  Size,...,%TCP  Traffic, %UDP  Traffic) 
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Entropy  Based  Flow  Aggregation  (2006) 

Yan  Hu,  Dah-Ming  Chiu,  and  John  C.S.  Lui 
The  Chinese  University  of  Hong  Kong 


Based  on  Cisco’s  NetFlow  -  during  flooding  attacks  the  memory  and  network 
bandwidth  consumed  by  flow  records  can  increase  beyond  what  is  available. 

A  solution:  Adapting  sampling  rate. 

Flows  of  security  attacks  usually  have  common  patterns  and  form  conspicuous  traffic 
clusters. 

Identifies  clusters  of  attacks  flows  in  real  time  and  aggregated  those  large  number  of 
short  attack  flows  to  a  few  meta  flows. 

Same  sourcelP  ~  worm  propagation 
Same  destIP  ~  Denial  of  Service  Attack 
Same  destIP  and  SourcelP  ~  most  portscan 

Purpose  is  mostly  security. 
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On  the  correlation  of  Internet  flow 

characteristics  (2003) 

Kun-Chan  Lan,  JOHN  HEIDEMANN 
Information  Science  Institute,  University  of  Southern  California 


A  small  percentage  of  flows  consume  most  of  the  network  bandwidth. 

Study  of  heavy  flows  in  4  orthogonal  dimensions: 

•  Size 

•  Duration 

•  Rate 

•  Burstiness 

and  examine  their  correlations. 

Strong  correlation  between  size,  rate,  burstiness 
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Automatically  Inferring  Patterns  of  Resource 
Consumption  in  Network  Traffic  (2003) 

Cristian  Estan,  Stefan  Savage,  George  Varghese 
University  of  California,  San  Diego 


Method  of  traffic  characterization  that  automatically  groups  traffic  into 
minimal  clusters  of  conspicuous  consumption. 

It  is  not  a  static  analysis  that  captures  flow  characteristics  but  instead 
produces  hybrid  traffic  definition  that  match  the  underline  usage. 


Purpose  is  mostly  resource  consumption. 
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Analyze  S(t)  over  time 

Approaches: 

1 .  Use  clustering  techniques  (e.g.,  spectral  clustering,  k-means  based 
algorithms,  etc.)  to  clusterize  the  observing  nodes  and  infer  correlations 
between  observations  and  snapshots  across  the  network. 

1 .  Study  how  clusters  change  over  time  and  characterize/detect 
anomalies. 

2.  Use  clusters  to  produce  a  graphic  representation  of  the  traffic. 

3.  Define  discrete  models  to  describe  the  evolution  of  clusters  in  relation 
to  specific  events:  coordinated  computer  attacks,  presence  of  covert 
channels,  bugs  in  the  network  software,  hardware  breakdowns,  etc. 

2.  Define  State  Space  models. 

3.  Apply  learning  techniques  to  learn  models. 
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Spectral  Clustering 


Input:  Similarity  Matrix  M=[aij],  ,  number  k>0 
<ijj  =  s(Xt ,  Xj )  e.g  ay  =  exp(- 1 X,  -  X} 


2<t2) 


•  Build  similarity  graph.  For  example  the  Graph 
whose  adjacency  matrix  AG  =  M. 

•  L  =  Laplacian(  AG  ) 

•  Compute  the  k  eigenvectors  of  L  associated 
with  the  k  smallest  eigenvalues:  vl,  v2,...,vk 

•  V  =  [vl  v2  ...  vk],  nxk  matrix 

•  Pickthe  rows  of  V:  yl,  y2,...,yn 

•  Cluster  yi’s  using  k-means  algorithm  into 
C1,C2,...Ck 

Output:  clusters  C1,C2 . Ck 
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Discrete  Models  of  Cluster 

Evolution 


Idea:  Build  DFA  models  to  identify  transitions.  In 
this  case  we  identify  anomalies  by  studying 
the  current  clustering  in  relation  to  the 
previous  “snapshot”  of  traffic 


FloCon  2008,  Savannah  GA  ,  Jan  7-10,  2008 


Challenges 


Parameter  estimation:  in  our  example  of  clustering  k 
was  fixed. 

Apply  Bayesian  learning  techniques  to  infer  k. 

Apply  mixture  models  technique  to  clustering 

Define  and  learn  models  of  the  system’s  dynamics. 

Identify  relevant  attributes  of  flow  aggregators  to 
obtain  significant  vectors. 

Define  appropriate  similarity  function. 
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Outline 


1.  Extension  of  previous  work  on  Flow  Aggregation, 
(Flocon  2006). 

2.  Embedding  of  network  traffic  in  an  Euclidian  Space. 

3.  Complex  modeling  through  clustering. 

4.  Planned  work. 
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Planned  Work 


Implement  clustering  method. 

Develop  discrete  models. 

Build  a  software  monitor  to  analyze  traffic 
through  clusters  and  vector  representation. 
Experimental  analysis  of  the  efficaciousness 
of  our  approach. 
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Talk  outline 


The  Problem 

•  Noise  in  traffic  flows 

•  Impact  on  anomaly  detection 

Two  Stage  Filtering 

•  Log  Filtering 

•  State  Filtering 

Attack  reduction 

•  Core  assumptions 

•  Method  for  data  removal 

•  Impact 
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Numberof  Flow  Records 


Innocuous  Attacks 
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Size  (nodes) 


Normal  SSH  Activity 


04:00  04:03  04:06  04:09  04:12  04:15  04:16  04:21  05:00 

Time 


I  Anomalous  Normal  *  I 
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Raw  SSH  Data 
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A  Hypothesis 

We  see  two  populations: 

•  Normal  users,  who  know  where  they’re  going 

•  Attackers,  primarily  scanners,  who  have  no  idea  about 
the  network’s  structure 

The  majority  of  attackers  are  clumsy 

•  Low  success  rates 

•  Picking  targets  effectively  at  random 

•  Pick  many  more  targets  than  there  are  actual  targets 

—>350,000  per  30s  period,  vs.  ~  10,000  real  targets 
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Frequency  erf  occurrence 


Comparing  the  two  populations 


Largest  component  size 


25000 
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Impact  on  anomaly  detection 


Almost  every  anomaly  detection  system  requires 
advance  knowledge 

•  Mean,  standard  deviations 

•  Map  of  known  servers 

This  information  may  not  be  easily  acquired 

•  Inventory  is  nontrivial 

•  Going  by  the  data  can  lead  to  false  positives  from 
attackers 

We  need  to  train  the  system  while  acknowledging  the 
hostility 
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Filtering:  Log  Filtering 

At  least  with  TCP  data,  we  can  rely  with  the  state 
machine 

•  <  3  packets  implies  it  is  most  likely  a  scan 

•  >  3  packets  may  be  legitimate 

In  a  two  week  ssh  dataset: 

•  <  3  packets  make  up  87%  of  the  flows 

•  <3  packets  make  up  1%  of  total  bandwidth 
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Observed  hosts 


Log  Filtering  is  Insufficient 


12:00  12:15  12:30  12:45  13:00  13:15  13:30  13:45  14:00 

Time  (September  4,  2007,  GMT) 
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State  Filtering 


If  we  assume  activity  is  Gaussian,  then  we  can 
identify  and  eliminate  outliers 

Simple  test:  Shapiro-Wilk  test  for  normalcy 

•  God  for  25-2000  samples 

•  Doesn’t  require  an  estimate  of  mean  or  standard 
deviation 
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Very  coarse 


DfFser  fe-ilaliuft  in  horrlef 
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-50  0  50  100 


How  many  attacks  can  we  stand? 
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Conclusions 


Constant  noise  is  managable 

•  But  it  requires  integrating  multiple  filtering  mechanisms 

•  It  also  means  assuming  a  certain  mode  of  behavior 

—This  method  assumes  gaussian,  other  tests  are  available 

Open  questions: 

•  What  do  we  do  with  scans  once  we  know  they’re  there? 
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Privacy,  Data  Protection  Law  and 
Flow  Data  Anonymisation: 
requirements,  issues,  and  challenges 


Elisa  Boschi,  Hitachi  Europe 

Ralph  Gramigna,  KPMG 


Acknowledgement:  M.  Bossardt  (KPMG),  D.  Battisti  (ETH) 


Outline 


Review  of  law  principles  and  requirements  on 
data  protection 

-  European  viewpoint 

-  What  is  personal  data? 

-  Why  is  data  protection  law  relevant  for  network 
monitoring? 

-  Law  principles  overview 

The  role  of  flow  data  anonymisation  to  support 
data  protection 

-  Discussion  on  its  applicability  and  weaknesses 

-  Suggestions  for  future  steps 


Data  Protection  Law:  EU  Directives 


■  Goal:  protect  the  privacy  of  individuals 

-  Not  limited  to  information  confidentiality 

■  EU  Directives  define  the  the  minimum  law 
requirements  to  be  implemented  by  each  EU 
member  state 

-  Applicable  to  international  data  transfers  with  EU 

■  Relevant  to  data  protection: 

-  Directive  1995/ 46/EC  -  on  data  protection 

-  Directive  2002/ 58/ EC  -  on  privacy  and  electronic 
communications 


Applicability  and  Personal  Data 


■  Directive  95/ 46/EC  applies  to  the 


processing  of  persona!  data" 


X 

"  any  information  relating  to  an  identified  or  identifiable 
natural  person  (data  subject');  an  identifiable  person  is 
one  who  can  be  identified,  directly  or  indirectly,  in 
particular  by  reference  to  an  identification  number  or  to 
one  or  more  factors  specific  to  his . . .  identity. 


"any  operation  performed  upon  personal  data,  such  as  e.  g. 
collection,  storage,  adaptation  or  alteration,  consultation, 
disclosure  by  transmission,  dissemination  or  otherwise  making 
available,  alignment  or  combination,  erasure  or  destruction" 


■  Note:  in  some  countries  (e.g.  Switzerland)  this  applies  to 
Jegal  entities'*  as  well 


Applicability  to  Network  Monitoring 

■  Indirect  identification  data  comprise  any 
information  that  may  lead  to  identification  of  the 
data  subject  through  association  with  other 
available  information 

-  information  available  to  the  entity  in  charge  of  the 
data  processing  (ISP), 

-  any  information  possessed  by  third  parties 

■  IP  addresses  can  identify  someone  "directly” 

-  Esp.  legal  entities 

■  Many  more  attributes  in  a  flow  record  can 
contribute  to  identifying  someone  "indirectly" 


Principles:  legitimation  for  processing 


1.  Consent 

2.  Data  processing  is  ..necessary  for  the  performance 
of  a  contract  to  which  the  data  subject  is  a  party? 

3. 


■  Processing  must  be  limited  to  specified  purposes 


■  Further  processing  of  data  for  historical,  statistical 
or  scientific  purposes  is  possible  provided  that 
appropriate  safeguards  are  provided 


Left  to  national  laws 


Principles:  I  nformation  of  the  Subject 


The  subject  must  be  informed  about: 

1.  I  dentity  of  the  data  controller 

2.  Purpose  of  the  processing 

3.  Other  information,  e.g.  the  recipient  of  the  data. 

■  It  does  not  apply  to  scientific  research,  I F  the 
provision  of  such  information 

—  proves  impossible 

—  would  involve  a  disproportionate  effort 

■  Appropriate  safeguards  must  be  provided 
—  Their  specification  is  let  to  national  law 


Border  Crossing 


■  Transfer  to  third  countries  is  generally  possible  if 
the  third  country  ensures  an  adequate  level  of 
protection 

http://ec.europa.eu/i ustice  home/fsi/privacv/thrid 

countries/ index  en.htm 

■  E.g. 

\/  Switzerland,  Canada,  Argentina 
X  USA  (except  Safe  Harbor) 


T raffic  data  and  location  data 


■  Introduced  in  Directive  2002/ 58/ EC 

-  Traffic  data;  any  data  processed  for  the  purpose  of  the 
conveyance  of  a  communication  or  for  the  billing  thereof 

-  Location  data:  data  indicating  the  geographic  position  of 
the  terminal  equipment  of  a  user 

■  Objectives: 

-  Minimise  the  processing  of  personal  data 

-  Use  anonymous  or  pseudonymous  data  where  possible. 

■  ,Anonymous"  =  it  is  no  longer  possible  to  identify 
the  data  subject 


Processing  of  Traffic  and  Location  Data 


■  Traffic  and  location  data  relating  to  subscribers  and 
users  must  be  erased  or  made  anonymous  when  no 
longer  needed 

■  The  processing  of  traffic  data  must  be  restricted 

-  To  persons  acting  under  authority  of  providers 

-  To  certain  activities  (e.g.  traffic  management,  fraud 
detection...) 


■  Location  data  can  be  processed  only  if 

-  There  is  consent,  or 

-  Data  is  made  anonymous 


The  Role  of  Flow  Data  Anonymisation  to 

Support  Data  Protection 


■  The  well  known  problem: 

-  The  more  you  anonymise  the  better  privacy  is  protected... 

-  ...but  the  less  useful  the  data 

■  Anonymisation  aims  at  removing  sensitive  information 
referring  to  an  individual 

■  Attacks  to  anonymisation  schemes  have  proved  that 
those  schemes  could  be  broken  allowing  to  "indirectly" 
identify  people. 

■  Are  known  flow  anonymisation  techniques  effective  in 
protecting  the  privacy  of  individuals? 


(4)  Anonymization  Techniques 


Field  to  be  anonymized: 

I P  address 


IP 

Truncation 

Permutation 

Black 

Marker 

Prefix  Preserving 

135.98.111.17 

135.98 

141.  2.  32.37 

10.1.1.1 

22.131.88.67 

135.98.111.128 

135.98 

41.12.96.  67 

10.1.1.1 

22.131.88.157 

135.98.132.37 

135.98 

142.72.8.5 

10.1.1.1 

22.131.201.29 

141.161.3.3 

141.161 

21.33.4.1 

10.1.1.1 

12.192.32.51 

141.72.8.5 

141.72 

11.14.96.118 

10.1.1.1 

12.78.201.97 

32.53.48.1 

32.53 

12.161.3.3 

10.1.1.1 

31.197.3.82 

Some  Anonymisation  Attack  Methods 


Data  injection  i=  injecting  information  to  be  logged  with  the 

purpose  of  later  recognizing  that  data  in 
the  anonymized  trace 

■  Fingerprinting  i  matching  attributes  of  an  anonymized  object 

against  those  of  a  known  object  (e.g.  web 
server)  to  discover  a  mapping  between  them 


■  Semantic  attacks 


system  is  exploited  in  a  way  that  the  victim 
thinks  to  do  something,  but  he  is  doing 
something  different.  The  attacker  may  infer 
part  of  the  unanonymized  I P  address  by 
exploiting  the  semantics  of  prefix  preserving. 


■  Structure  recognition  I - >  recognizing  structure  between 

anonymized  and  unanonymized  objects 


Attacks  vs.  Anonymisation  Techniques 


^\Anonymisation 

Attacks 

Prefix¬ 

preserving 

Cryptographic 

approach 

T  runcation 

Permutation 

Semantic  attack 

■ 

■ 

Cryptographic  attack 

■ 

■ 

Data  Injection 

■ 

■ 

■ 

Fingerprinting 

■ 

■ 

■ 

Structure 

■ 

■ 

■ 

Recognition 

■  the  attack  can  be  used,  (partial)  results  achieved 


Conclusions 


■  We  need  to  pay  attention  to  data  protection  laws 


■  Anonymisation  is  part  of  the  solution  to  protecting 
privacy,  but 

-  Research  is  still  needed 

-  This  is  not  only  a  technical  problem;  a  technical  solution 
alone  is  not  enough 

■  Legal  solutions,  policies,  guidelines,  interdisciplinary 
work  are  needed 

■  Anonymisation  support  is  needed  in  standard  flow 
data  export  protocols  such  as  I PFI X 


Automatic  anomaly  detection 
using  NfSen 

Wim  Biemolt,  SURFnet 


Werner  Schram,  SURFnet 


Automatic  anomaly 
detection  using  NfSen 


-  SURFnet  and  netflow  anomaly  detection 

-  NERD 

-  NfSen 

-  PeakFlow  SP 

-  Currently  used  detection  methods 

-  DDos 

-  Botnet 

-  Holt-Winters  aberrant  behavior 
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SURFnet  and  netflow 
anomaly  detection 


-  NERD  vl 

-  Developed  by  TNO 

-  Based  on  cflowd 

-  cflowd  is  no  longer  supported 

-  NERD  v2 

-  Initially  developed  by  TNO 

-  Has  serious  performance  problems 

-  NfSen  can  do  the  same  but  without  the 
performance  problems 
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IlftJ 

-  Netflow  Sensor  (NfSen)  is  a 

-  network  statistics  tool 

-  Developed  by  Peter  Haag 

-  Currently  in  active  development 

-  Alert  plug-in  system 

-  Generic  plug-in  system 

-  Some  plug-ins  already  available 


SURFnet 


Automatic  anomaly  detection  using  NfSen 


% 


NfSen 
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|  Home  1 1  Graphs  j  |  Details  |  j  Alerts  |  j  Stats  1 1  Plugins  |  live  Bookmark  URL  Profile:  live 


Overview  Profile:  live,  Group:  (nogroup) 
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DDos  detection 


-  Simple  flow  analysis 

-  based  on  NERD  vl  DDos  detection 

-  using  a  low  threshold  and  a  high  threshold 

-  Rules  for  traffic  between  those  thresholds 

-  Custom  thresholds  for  high  load  services 
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DDos  interface:  report 


Home  Graphs  Details  Alerts  Stats  Plugins  live  Bookmark  URL  Profile:  live  t 


J  alarm  Events 


J  report  setup 


thresholds 


botnets 


number  of  alarms  to  show: 

from 

up  to 

alarms: 

Ok  | 
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(0  for  all) 
days  ago 
days  ago 


||  ddos  ^~| 


The  ddos  alarms  between  2007-12-07  and  2007-12-15 


ID  Destination 
#505981 


Flows  per  5  minutes 

Average  packets/flow  Average  byte  s/flow 

Starttime 

Stoptime 

7772 

5054 

4  2007-12-14  08:55:00  2007-12-14  16:32:50 

1 

Delete 

10620 

3859 

4  2007-12-14  08:39:54  2007-12-14  16:32:50 

1 

Delete 

9510 

3147 

3  2007-12-14  08:25:01  2007-12-14  16:32:50 

1 

Delete 

12951 

129 

2  2007-12-14  08:24:58  2007-12-14  16:32:50 

1 

Delete 

9517 

73 

1  2007-12-13  06:13:41  2007-12-14  16:32:50 

1 

Delete 

281618 

163 

1  2007-12-04  14:47:47  2007-12-14  16:32:50 

1 

Delete 

327975 

125 

1  2007-11-27  13:19:14  2007-12-14  16:32:50 

1 

Delete 

22047 

171 

2  2007-11-26  13:32:20  2007-12-14  16:32:50 

1 

Delete 

5222 

2550 

3  2007-12-14  16:20:07  2007-12-14  16:29:56 

1 

Delete 

6031 

1155 

7  2007-12-14  11:44:53  2007-12-14  16:22:51 

1 

Delete 
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DDos  interface:  Details 


i  Home  1 1  Graphs  Details  |  Alerts  |  Stats  Plugins  live  Bookmark  URL  Profile:  |  live  ▼ 


J  alarm  [ 


i  report  1 1  setup  |  j  thresholds  |  botnets  1 1  details:  50598 


Remove  tab  | 


analyse 


t  packet 
threshold  \~ 


2007-12-14  08:55 
2007-12-14  16:37 
Change  | 


10:0Q  12:00 

□  Flows 

Avg  flows/5  min:  7,66  k 


14:00  IS:  00 

Max  flows/5  min:  9,81  k 


□  Bytes 

Avg  Bytes/5  min : 


14:00  IS:  00 

Max  Bytes/5  min:  72,73  M 


Top  10  flows  per  5  minutes  at  2007-12-14  16:37:40: 
vs  byte  s  p  ort  us  age 


1379  2947950  mm:  1046,  max:  65508  2007-12-14  12:37:51 
1353  2897466  mm:  1038,  max:  65509  2007-12-14  12:53:00 


Report  port  scan  | 

analyse  | 

Report  port  scan  | 

analyse  | 

Report  port  scan  | 

analyse  | 

Report  port  scan  | 

analyse  | 
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Botnet  detection 

-  Hosts  infected  by  viruses  connect  to  hosts  known  as 
botnet  controllers 

-  List  of  botnet  controllers  are  available,  for  example: 

http://www.  bleedinathreats.net/rules/bleedina-botcc.  rules 

-  Our  plug-in  logs  all  hosts  that  connect  to  known  botnet 
controllers 

-  Automatically  reports  to  incident  report  system  using 
IODEF 
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Botnet  IODEF  reports 


lang=MenM> 


<?xml  version^" 1 . 0 "  encoding="iso-8859-lM ?> 

<io : IODEF-Document  xmlns : io="urn : ietf iparams : xml : ns : iodef-1 . 0" 

<io: Incident  purp 

<io :  incidentiD  '-s /-"  “  \  ^  lncidentdetailsSURFcert#01 9038 

<io  .  StartTime>2  \ -  J  |  Majn  menu |  Import  queue|  Incidents |  Search|  Close  current  incident |  Mail  templates |  Edit  settings |  Logout  | 

<io : EndTime>200 

<io : ReportTime> 

<io :  Assessment>  (Bewerken)  Externe  identificatie: 

<io: Impact  ty - 

</io : Assessment  (Bewerken)  Ticket  number(s): 

<io : Contact> 

<io : contactNa  Elementaire  incidentgegevens 

</io : Contact> 

<io : EventData> 


incidentsoort 
incidenttoe  stand 


<io :Method> 

<io : Ref eren 

<io:  Refer  Incidentstatus 

</ io :  Ref  ere  Datum  van  inddent 


Logboekinformatie 


infected 


E 


spection  requested  [  ^ 


open 


update 


</io : Method> 

<io : Flow> 

<io : System 
<io : Node> 

<io : Add 
<io : Cou 
</io :Node 
</io : System 
<io : System 
<io : Node> 

<io : Add _ 

</io :Node 

<io : Servi  Bel'nvloedde  IP-adressen 

<io : Por 

</io :  Serv  |p  adres  Machinenaam 

</io : System 

</io :  Flow>  192.168.1  1  infected.host 
</io : EventData> 

<io  :  AdditionalD  IP  adres 
NFSen</io : Additiona 
</io : Incident> 

</io : IODEF-Document> 


2flB||aug  Ell2007  |17E||(KHI[03E1 


Source  (ip) 

192.163.1.1 

Target  (ip:port) 

192.163.1.2 

Packet  (type : count ) 

flow: 23 

Start  time 

2  QG7-Q3-13T15 : 07 : 47+02 : 00 

End  time 

2007-03-13T 21:06:12+02 :00 

Constituency  Rot  in  inddent  Bewerken  Verwijder 
utwente.nl  Unknown  bewerken  verwijderen 


Unknown  [^]  |  Toevoegen 
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Holt-Winters  aberrant 
behavior  detection 


-  Uses  information  about  periodic  data  to  predict 
aberrant  behavior. 


Flows/s:  Fri  Dec  7  16:35:QG  2GQ7  -  Fri  Dec  14  16:35:00  2GG7 


5 

i— i 


o 

=1 
I— I 

7?. 

m 

7? 
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X 


Holt-Winters:  Example 
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Holt-Winters: 

Original  implementation 


Trend  Periodic  information  Noise 


Prediction 
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/jfev  \  Li  mitations  of  the 

Z^L  original  implementation 


-  The  original  algorithm  has  three  parameters  which 
define: 

-  the  weight  of  historical  data 

-  the  weight  of  the  trend 

-  the  amount  of  expected  noise 

-  The  original  algorithm  has  a  constant  learning  rate 

-  If  a  low  learning  rate  is  used,  the  selection  of  the  initial  values  is  critical. 
This  will  introduce  false  positives  for  a  long  time. 

-  With  a  high  learning  rate,  the  model  will  likely  be  overfitted.  This  will 
introduce  false  negatives 

-  The  trend  parameter  has  no  significant  influence  with 
the  resolution  we  are  using 
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Holt- Winters: 
Multiple  trends 


Network  traffic  time  series  often  show  multiple 
recurring  patterns,  for  example  a  weekly  trend 


Flows/s:  Fri  Nov  30  16:35:00  2007  -  Fri  Dec  7  16:35:00  2007 


O 


50 


4G 


30 


- ’ll 

'  ‘  .  i 

V 

/ 

Sat  01  Sun  02  Mon  03  Tue  GU  Wed  05  Thu  06  Fri  07 


□  core-dst 
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Holt- Winters: 
Multiple  periods 


Daily  Period 


Weekly  period 


Noise 
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Learning  rate 


Fixed  learning  rate: 

The  first  pattern  is  overweighted 


Adaptive  learning  rate: 
The  weight  of  the  first  pattern 
is  relative  to  the  rest 


100.00% 


90  00% 


80.00% 


70.00% 


60.00% 


weight  50.00% 


40.00% 


30.00% 


20.00% 


10.00% 


0  00% 


prediction 


100% 

90% 

80% 

70% 

60% 

weight  50% 

40% 

30% 

20% 

10% 

0% 

1  2 


1 
3 
5 

iteration 
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Real  data  example 


SURFnet  -  Automatic  anomaly  detection  using  NfSen 


Bits/s  proto  ICMP 


Hol^  Winters: 
Usage  Example 


Thu  Jul  26  19:45:00  2007  Bits/s  proto  ICMP 


70  k 
GO  k 


CTTTJ 

IM 

Normal  ICMP  Traffic 


Aberrant  ICMP  Traffic: 
Caused  by  DDos  attack 
by  Stormworm 
botnet 


Mon  Jul  23  19:45:00  2007  Bits/s  proto  ICMP 


□  Trillian  □  Arthur  H  Zaphod  ■  Ford 
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Holt  Winters: 
Other  possible 


Common  SMTP  Traffic 


Last  week  SMTP  Traffic 
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Wim  Biemolt 
Wim.Biemolt@surfnet.nl 

www.surfnet.nl 


Werner  Schram 
Werner.Schram@surfnet.nl 

www.surfnet.nl 
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YAF 


Open-source,  IPFIX-compliant  bidirectional  flow  meter 

•  Available  from  http://tools.netsa.cert.org 

Processes  packets  from  multiple  inputs 

•  libpcap  dumpfiles  (ad-hoc  packet  analysis) 

•  libpcap  live  capture  (including  proprietary  pcap  interfaces,  e.g.  Bivio) 

•  Endace  DAG  live  capture 

Performance  is  network  hardware  and  I/O  bound... 

•  ...easily  handles  0C3,  0C12,  GigE  at  line  speed,  but 

•  lOGigE  requires  proprietary  hardware  at  saturation. 


Software  Engineering  institute 


Carnegie  Mellon 
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Flow  Meter  Design 


,CEFTC 


Software  Engineering  Institute  Carnegie  Mellon 


©2008  Carnegie  Mellon  University 
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Flow  Meter  Effects  on  Flow  Data 


Fragmentation 
End  Conditions 
Timeouts 
Delta  Counters 
Biflows 

The  Packet  Clock 


Software  Engineering  institute 


Carnegie  Mellon 
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Fragmentation 

Three  approaches  for  flowing  fragmented  traffic: 

•  pretend  there’s  no  such  thing  as  fragmentation, 

•  drop  all  fragmented  packets,  or 

•  full  or  partial  fragment  reassembly 

Each  approach  has  tradeoffs,  and  is  applicable  in 
certain  situations. 

YAF  supports  partial  reassembly. 


Software  Engineering  institute 


Carnegie  Mellon 
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Fragmentation? 

Easiest  way  to  handle  fragmentation:  don’t. 

Leads  to  inaccurate  flow  data  as  subsequent  fragment  port 
numbers  are  incorrectly  decoded: 


Software  Engineering  institute 


Carnegie  Mellon 


©2008  Carnegie  Mellon  University 
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Fragmentation?  (2) 


Often  used  in  resource-restricted  environments  (e.g., 
routers). 

•  Much  faster:  no  requirement  even  to  recognize 
fragmented  packets. 

•  Much  less  memory  consumption:  no  fragment  table. 

•  Less  susceptible  to  resource  exhaustion  attacks. 

Trivially  easy  to  implement. 

Difficult  or  impossible  to  recover  actual  flows  from 
random  fragment  offset  port  data. 


Software  Engineering  institute 


Carnegie  Mellon 
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Dropping  fragmented  packets 


Requires  minimal  resources  at  flow  meter: 

•  need  to  recognize  fragments,  but  not  store  them. 

Leads  to  meter  blindness: 

•  all  an  attacker  must  do  to  hide  from  the  measurement 
infrastructure  is  fragment  all  packets. 

Only  applicable  behind  perimeter  devices  which  also 
drop  all  fragmented  packets. 


s  /  p 

dip  pr  ot  o  sp 

dp 

pkt  s 

[no  f  1  ows  ] 

Software  Engineering  institute 


Carnegie  Mellon 
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Partial  fragment  reassembly 


Associate  each  fragmented  packet  with  its  actual 
transport  ports: 


,CEFTC 


Software  Engineering  Institute  Carnegie  Mellon 
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Partial  fragment  reassembly  (2) 

Accurately  assigns  fragments  to  respective  flows. 
Requires  additional  resources  at  flow  meter: 

•  need  to  recognize,  look  up,  and  store  every  fragment. 

More  difficult  to  implement  and  maintain. 

Requires  care  to  avoid  vulnerability  to  resource 
exhaustion  attacks. 


Software  Engineering  institute 


Carnegie  Mellon 
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Flow  End  Conditions 

Flow  meter  must  recognize  actual  connection 
shutdown... 

•  . .  .through  varying  degrees  of  modeling  the  host  TCP 
state  machine. 

Flows  on  the  wire  are  not  always  so  well-behaved. 
Example:  multiple-RST  teardown. 


Software  Engineering  institute 


Carnegie  Mellon 
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Multiple  RST  teardown 

How  many  flows  here? 


s  /  p 

di  p 

flags  sp 

dp 

pkt  s 

Y.  Y.  Y.  Y 

X.  X.  X.  X 

SAF 

X 

y 

6 

Y.  Y.  Y.  Y 

X.  X.  X.  X 

SAF 

y 

X 

3 

Y.  Y.  Y.  Y 

X.  X.  X.  X 

R 

y 

X 

1 

,CEFTC 
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Multiple  RST  teardown  (2) 


Tempting  to  group  RSTs  on  teardown  into  original 
flow. . . 

•  . . .how  long  to  keep  closed  flow  state? 

•  . .  .how  far  to  take  this  RST  grouping? 

•  . . .how  to  communicate  new  configuration  parameters  to 
analysts? 

YAF  stays  predictable,  at  the  expense  of  generating 
multiple  flow  records  for  this  behavior. 


Software  Engineering  institute 
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Passive  Timeouts 


Flows  which  have  no  packets  over  7"Opassive  seconds 
are  closed. 

Necessary  to  terminate  flows  for  all  non-connection- 
oriented  transports, 

•  i.e.,  anything  but  TCP. 

Longer  passive  timeouts  consolidate  low-frequency 
periodic  activity  into  fewer  flows. 

Shorter  passive  timeouts  reduce  flow  table  resource 
consumption  for  such  activity. 


Software  Engineering  institute 
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Passive  timeouts  (2) 


Generally  chosen  to  match  common  protocol  timeouts... 

•  ...  which  are  generally  round  numbers,  e.g.,  10,  30,  60  sec. 

May  be  chosen  to  avoid  flow  closure  ambiguity  due  to  minor 
variations: 

•  e.g.,  12,  33,  64  sec. 


flow  A',  (l2sTOp J 

flow  A,  (10s  TOpassiJ 

1  Is 

A2 

A  ios  A  ios  A  ios  A  ios  A  ios  A  IQs  A  ios  A 

A  9s  A 

time 


A’ 


2 


I  Os 
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Active  Timeouts 


Flows  which  have  been  open  for  rOactive  seconds  are 
closed. 

•  Maximum  flow  duration  is  7"Oactive  seconds. 

Necessary  to  ensure  long-lived  flows  are  eventually 
flushed  from  the  flow  table. 

Active  timeout  determines  reporting  delay. 


Software  Engineering  institute 


Carnegie  Mellon 
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Active  Timeouts  (2) 

Shorter  active  timeouts  used  for  more  rapid  reporting. 
Longer  active  timeouts  used  for  better  data  reduction. 


flow  A’ ,  (90s  TOartiv(,) 

A’, 

A,  (30s  TOactjve)  A2 

A, 

a4 

A  i  Os  A  i  Os  A  i  Os  A  i  Os  A  ios  A  ios  A 

ios  A  i Os  A  i Os  A 

1 5s  A  10s  A  ^ 

> 


time 
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Delta  Counters 


Flow  meters  which  periodically  emit  multiple  flow  records  per 
flow  (for  rapid  reporting)  may  use  total  or  delta  counters. 

Total  counters  replace  values  in  previous  flow  records. 

Delta  counters  add  to  values  in  previous  flow  records... 

•  ...thereby  reducing  state  requirements  on  meter  and  increasing 
them  on  collector. 

YAF  uses  total  counters,  but  doesn’t  emit  multiple  records  per 
flow... 

•  . .  .uses  active  timeout  instead. 
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Biflows 


Representation  of  two  sides  of  a  connection  with  a  single  flow 
record : 

•  Allows  additional  data  reduction 

•  Enables  easier  connection  analysis 

•  Improves  flow  state  modeling  at  flow  meter 

YAF  is  a  biflow  meter,  but  SiLK  stores  uniflows. 


src  (X) 

dst  (Y) 

counters/values 

src  (Y) 

dst  (X) 

counters/values 

src  (X) 

dst  (Y) 

fwd  counters/values 

rev  counters/values 
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The  Packet  Clock 


Important  to  drive  all  processes  within  a  flow  meter 
with  a  single  clock 

•  fragment  timeouts,  flow  timeouts,  time  stamping,  etc. 

When  building  a  flow  meter,  gettimeofday(2)  is  not 
your  friend. 

•  often  a  problem  with  porting  host-based  software  into  a 
network-based  monitoring  environment 

Use  the  timestamp  from  the  packet  instead! 

•  ensures  that  the  resulting  flow  stream  identical  whether 
captured  live  or  generated  from  dumpfile. 
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Getting  YAF 

http://tools.netsa.cert.org 

Builds  on  Mac  OS  X,  Linux,  BSD,  Solaris 

•  Bug  reports  from  these  or  other  Unices  welcome! 

Some  prerequisites 

•  glib-2.0  (C  modernization  layer) 

•  libairframe  (application  utility  library  from  NetSA) 

•  libfixbuf  (IPFIX  protocol  implementation  from  NetSA) 

•  libpcap  (generally  available  on  most  modern  Unices) 

•  libdag  (only  required  for  Endace  DAG  capture) 
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Questions? 

Ask  now... 

...or  later: 

•  Brian  Trammell  <bht@cert.org> 

•  Chris  Inacio  <inacio@cert.org> 
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Abstract — Sharing  of  log  data  is  a  valuable  step  towards  the 
improvement  of  network  security.  However,  logs  often  contain 
sensitive  information  and  organizations  are  hesitant  to  share 
them.  Anonymization  methods  are  used  for  increasing  protec¬ 
tion,  lowering  the  disclosure  risk  to  a  level  considered  safe. 
Accordingly,  a  metric  for  anonymity  is  necessary  to  quantitatively 
assess  the  risk  before  releasing  log  data.  In  this  paper,  we 
propose  a  general  framework  for  estimating  disclosure  risk  using 
conditional  entropy  between  the  original  and  the  anonymized 
datasets.  We  demonstrate  our  approach  using  network  log  files. 

I.  Introduction 

Log  data  analysis  is  a  powerful  tool  for  improving  network 
security.  Typically,  each  organization  uses  only  their  own  logs, 
but,  with  the  increasing  number  of  coordinated  attacks,  sharing 
log  information  between  organizations  is  becoming  essen¬ 
tial  [1].  The  problem  is  that  organizations  are  often  reluctant  to 
share  their  logs,  since  the  information  contained  in  them  can 
be  sensitive.  Anonymization  methods  are  used  for  limiting  dis¬ 
closure  risk  in  releasing  such  sensitive  datasets.  Anonymizing 
data  increases  protection,  lowering  the  disclosure  risk,  but,  it 
also  decreases  the  quality  of  the  data  and  hence  its  utility  [2]. 
Finding  the  optimal  trade  off  between  risk  and  utility  is  the 
main  scope  of  the  anonymization  process.  Both  quantities  are 
hard  to  define,  and  strongly  depend  on  context  variables,  e.g., 
data  usage,  level  of  knowledge  of  the  attacker,  amount  of 
data  released.  In  this  paper  we  focus  on  the  evaluation  of 
disclosure  risk  (Section  II).  The  main  contribution  of  this 
paper  is  to  introduce  a  general  measure  of  disclosure  risk, 
which  is  applicable  to  any  set  of  masking  transformations. 
Unlike  previous  measures,  we  do  not  assume  any  specific 
masking  algorithm.  Moreover,  our  measure  provides  a  robust 
estimation  of  risk  both  at  single  record  level  (local  risk)  and 
at  global  level,  i.e.  for  the  whole  dataset  (Section  III).  Our 
model  is  therefore  general  and  can  be  applied  for  quantitatively 
comparing  different  anonymization  policies.  Furthermore,  it  is 
directly  related  to  the  measure  of  the  information  lost  in  the 
anonymization  procedure.  We  implemented  this  risk  estimator 
using  the  FLAIM  framework  [3]  and  tested  on  network  log 
files  (Section  IV). 


II.  Privacy  in  public  datasets 

Data  holders,  such  as  national  statistical  institutes,  often 
have  to  release  data  files  containing  information  on  individual 
people  or  firms  (micro-data)  for  research  purpose.  At  the  same 
time  they  have  to  preserve  the  privacy  of  individuals.  This 
problem  also  occurs  for  sharing  log  files,  since  they  may 
contain  personal  information  which  cannot  be  released  in  its 
original  form  (IP  addresses,  port  numbers,  timestamps,  quan¬ 
tities...).  Consequently,  these  data  holders  need  to  anonymize 
their  databases  before  release,  using  data  masking  algorithms 
such  as:  generalizing  the  data,  i.e.,  recoding  variables  into 
broader  classes  (e.g.,  releasing  only  the  first  two  digits  of 
the  zip  code  or  removing  the  last  octet  of  an  IP  address), 
suppressing  part  of  or  entire  records  (also  known  as  black 
marker  [3]),  randomly  swapping  some  fields  among  original 
data  records,  applying  permutations  (one-to-one  mapping  on  a 
defined  set)  or  perturbative  masking,  i.e.,  adding  random  noise 
to  numerical  data  values. 

When  masking  methods  have  been  applied,  data  holders 
have  to  quantitatively  assess  the  disclosure  risk  (or  anonymity 
level),  to  verify  whether  it  is  below  a  defined  threshold,  in 
which  case  it  is  assumed  to  be  acceptable.  To  this  scope, 
various  measures  for  estimating  disclosure  risk  have  been  pro¬ 
posed  so  far  [4],  [5],  [6];  their  validity  strongly  depends  on  the 
application  scenarios  considered,  but  still,  there  is  a  consensus 
that  the  risk  of  disclosure  cannot  be  reduced  to  zero  (but 
removing  all  the  information).  Thus,  in  general,  a  threshold 
should  be  determined  to  decide  whether  to  release  a  dataset  or 
not.  Broadly  speaking,  there  are  two  different  approaches  for 
assessing  disclosure  risk:  estimating  the  rareness  in  the  sample 
or  population,  or  estimating  the  probability  of  re-identifying  a 
masked  record  using  some  external  information. 

Let  us  examine  these  two  methods  in  detail.  In  a  typical  sce¬ 
nario  an  attacker  has  knowledge  about  some  variables,  which 
may  identify  a  record  in  the  dataset.  Considering  the  example 
of  a  medical  database,  the  attacker  may  know  a  few  attributes 
(age,  gender,  marital  status)  from  an  external  public  register 
(census  data)  or  some  private  source  of  information  (e.g., 
knowing  age  and  address  of  his  neighbor).  He  then  tries  to 


(a)  Original  log  file  S 


SrcIP 

SrcPort 

DestIP 

DestPort 

Packets 

168.125.253.23 

80 

147.81.124.173 

3157 

40 

39.109.219.43 

7310 

142.68.227.108 

59959 

126 

35.187.130.82 

161 

213.48.191.68 

55867 

83 

(b)  Anonymized  log  file  7 Z 


SrcIP 

SrcPort 

DestIP 

DestPort 

Packets 

168.125.253.0 

1023 

10.1.1.1 

65535 

42 

39.109.219.0 

65535 

10.1.1.1 

65535 

132 

35.187.130.0 

1023 

10.1.1.1 

65535 

81 

(c)  Background  knowledge  S 


SrcIP 

SrcPort 

DestIP 

DestPort 

Packets 

39.109.219.43 

7310 

142.68.0.0 

— 

— 

TABLE  I:  Example  of  original  (<S)  and  anonymized  log  files  (JZ).  In 
the  anonymization  process,  the  least- significant  8  bits  of  the  SrcIP 
are  blacked  out,  BM(8)  (replaced  with  Os).  SrcPort  and  DestPort 
are  partitioned  in  two  classes  (1023  and  65535),  called  binary 
classification  (C).  DestPrt  is  completely  blacked  out,  BM(32).  Packets 
are  perturbed  with  random  Gaussian  noise. 

match  these  variables  (keys)  with  the  partly  altered  records  in 
the  released  database.  In  the  case  of  log  files,  an  attacker  may 
inject  some  information  (e.g.,  scanning  some  specific  ports), 
with  the  goal  of  later  recognizing  them  in  the  anonymized  logs. 
When  a  unique  record  matches  a  combination  of  key  variables, 
the  intruder  can  re-identify  the  masked  record,  assuming  he 
is  certain  that  the  record  is  in  the  dataset.  In  fact,  even  if 
there  is  more  than  a  unique  match,  but  the  number  of  linked 
records  characterized  by  that  combination  of  keys  is  still  low 
(say  it  does  not  exceed  a  threshold  k ),  these  records  have 
a  high  risk  of  re-identification.  This  rule  is  known  as  k- 
anonymity  [7].  This  approach  has  some  limitations:  it  does 
not  consider  intruder’s  knowledge  explicitly,  and,  in  case  of 
continuous  variables  the  number  of  population  uniques  could 
be  extremely  large,  especially  when  these  data  are  randomly 
perturbed  during  the  masking  process. 

The  second  approach  consists  of  estimating  the  probability 
of  re-identification.  As  in  the  previous  case,  the  attacker  aims 
at  linking  pairs  of  records  in  the  released  database  with  his 
background  information  [8],  [4],  [9]).  This  method  permits 
to  assess  the  risk  in  both  categorical  and  continuous  data: 
a  record  is  considered  at  risk  if  this  probability  exceeds  a 
fixed  threshold.  The  main  issue  with  this  approach  is  finding 
a  reliable  strategy  to  compute  these  probabilities,  since  in 
case  there  are  many  records  with  similar,  close  to  threshold, 
probabilities  of  re-identification,  the  risk  estimation  can  be 
strongly  affected  by  random  fluctuations. 

III.  Entropy  based  risk  estimator 

The  protection  model  we  propose  here  creates  a  measure 
of  disclosure  risk  for  micro-data  release,  which  combines 
together  the  two  approaches  described  above.  This  allows  us 
to  develop  a  measure  applicable  in  general  cases  (i.e.  for  any 
kind  of  data  transformation,  as  when  using  the  probability  of 
re-identification  method)  and,  at  the  same  time,  it  considers  the 
whole  distribution  of  original  records  (as  in  k- anonymization). 
The  basic  idea  is  to  use  Shannon  entropy  as  a  measure 


of  disclosure  risk  for  a  single  record.  Entropy  metrics  have 
previously  been  proposed  for  computing  information  loss  [10], 
and,  more  recently  for  estimating  disclosure  risk  for  tabular 
data  [11]  and  in  network  communications  [12],  [13]. 

In  this  section,  we  briefly  review  the  theoretical  framework 
and  analyze  its  mathematical  features.  We  refer  the  reader 
to  [14]  for  a  more  extended  discussion  on  the  topic. 

Let  us  consider  a  dataset  S  containing  some  sensitive  data, 
e.g.,  network  log  files  (Table  1(a)).  Each  entry  s  G  S  of  this 
dataset  is  transformed  using  a  data  masking  procedure,  for 
example  one  or  more  of  the  ones  mentioned  in  the  previous 
section.  The  final  result  is  an  anonymized  version  of  S  dataset, 
which  we  call  1Z  (Table  1(b)). 

The  attacker  aims  at  re-identifying  released  data  by  linking 
them  with  some  external  information  or  background  knowl¬ 
edge  S  (Table  1(c)),  which  has  some  overlapping  attributes 
with  the  released  dataset.  If  the  attacker  is  able  to  reconstruct 
some  attribute  values  of  the  original  record,  we  have  a  privacy 
breach.  Because  the  data  holder  does  not  know  in  advance 
which  records  and  attributes  might  be  available  to  the  attacker, 
it^must  run  the  risk  analysis  on  the  whole  released  dataset 
S  =  S  and  assume  a  set  of  key  attributes  (called  quasi¬ 
identifiers  in  the  k-anonymity  framework)  the  attacker  might 
know  and  use  for  re-identification.  These  key  attributes  can 
coincide  with  the  whole  set  of  attributes.  The  re-identification 
procedure  consists  of  estimating  for  each  s  G  S  the  probability 
of  linking  it  with  a  record  r  G  1Z:  P(r\s).  Because  we  are 
assuming  S  =  S,  thereafter  we  will  consider  the  P(r\s) 
instead  of  P(r\s). 

We  can  estimate  this  probability  assuming  the  attacker 
simulates  the  data  masking  transformations  [15],  uses  the 
information  released  by  data  holders  (such  as  the  structure 
of  the  noise  added)  or  defines  a  distance  function  between 
records  [16].  Intuitively,  the  more  uncertain  the  mapping 
P(r|s),  the  lower  the  disclosure  risk.  Shannon’s  entropy 
can  be  used  to  estimate  this  uncertainty.  By  applying  it  to 
the  conditional  probability  P(r\s),  the  conditional  entropy  is 
obtained: 

H(TZ\s)  =  —  P(r|s)  log2  P(r\s)  (1) 

This  quantity  measures  the  risk  at  the  level  of  single  record 
s.  It  represents  the  average  number  of  binary  question  we  have 
to  ask  to  identify  the  corresponding  r  given  s.  Low  entropy 
values  indicate  an  almost  deterministic  mapping,  and  high 
risk  accordingly,  whereas  large  entropy  is  associated  to  low 
disclosure  risk. 

For  example,  in  the  case  where  a  selected  record  s  can  be 
linked  to  exactly  ks  indistinguishable  records  in  1Z  (as  in  k- 
anonymity  [7]),  we  have  a  uniform  distribution  over  the  ks 
records:  and  the  corresponding  specific  entropy,  Eq.  (1)  is: 

H(K\s)  =  log2  ks  (2) 

The  ^-anonymity  condition  over  the  whole  dataset  can  be 
written  as: 

k  >  minses^H^s^ 


(3) 


Global  identification  risk,  that  is  at  the  dataset  level,  can  be 
derived  from  the  local  risk  measures,  Eq.  (1).  One  possible 
choice  (see  [14]  for  other  options)  is  to  calculate  the  expected 
number  of  correct  matches  ( Ecm ,  herein): 

Ecm  =  ^  2H(n\s) 

s 

Ecm  is  the  average  number  of  correct  matches  considering 
the  intruder  is  randomly  guessing  according  to  P(r\s).  In  fact, 
the  entropy  H(TZ\s)  represents  the  average  number  of  binary 
questions  required  to  determine  r,  given  s  [14]. 

Ecm  differs  from  the  estimated  number  of  correct  matches, 
called  NTm  herein,  typically  used  for  global  risk  assessment 
(see  [15],  [9]).  These  two  measures  differ  because  Ntm  is 
based  on  maximum  likelihood,  which  implies  verifying  a 
posteriori  whether  a  match  is  correct,  whereas  Ecm  is  the 
average  number  of  correct  matches  considering  a  random 
guess  according  to  P(r\s).  So,  the  latter  lacks  the  decoding 
part  (i.e.,  the  maximum  likelihood  step)  and  relies  on  the  shape 
of  the  distribution  only.  In  practice,  they  coincide  when  P(r\s) 
has  a  single  sharp  peak,  that  is  an  almost  deterministic  one 
to  one  mapping.  In  contrast,  they  may  strongly  deviate  in 
presence  of  multiple  peaks  and/or  a  smooth  distribution.  In 
addition,  because  Ecm  depends  on  the  shape  of  the  whole 
distribution  (not  only  on  its  peak  value),  it  is  less  sensitive  to 
random  fluctuations  [14].  Lastly,  note  that  conditional  entropy 
is  directly  linked  to  the  mutual  information  between  S  and  7 Z, 
and  it  can  be  used  as  an  estimation  of  the  information  lost  in 
the  anonymization  transformation  [14]. 

IV.  Measuring  risk  on  anonymized  network  logs 

We  tested  the  entropy-based  risk  estimator  on  a  publicly 
available  Netflow  log  file  1 .  In  this  analysis,  we  only  use 
a  limited  set  of  fields  and  records.  In  addition  we  do  not 
consider  the  utility  of  the  masked  dataset.  Consequently, 
results  presented  here  should  be  viewed  as  a  proof  of  concept 
and  recommendations  for  selecting  a  specific  anonymization 
policy  are  not  provided. 

To  run  these  tests,  we  developed  a  risk-estimating  module 
based  on  FLAIM  (Framework  for  Log  Anonymization  and 
Information  Management)  [3].  FLAIM  is  a  modular  and 
scalable  framework  for  anonymizing  log  files  which  includes 
an  anonymization  engine  with  various  anonymization  prim¬ 
itives  (BlackMarker,  Permutation,  Enumeration,  etc  ...).  We 
developed  a  component,  RiskEngine  (see  Figure  1),  capable  of 
estimating  disclosure  risk  (Eq.  (4))  by  comparing  the  original 
and  anonymized  log  files.  As  the  other  FLAIM  components, 
the  risk  estimator  works  on  streamed  data,  allowing  us  to 
process  very  large  datasets. 

A.  Results 

To  illustrate  the  previously  described  method  and  its  imple¬ 
mentation  in  FLAIM,  seven  different  anonymization  scenarios 
are  presented.  As  testing  dataset  we  used  the  sample  Netflow 

1  Available  at  http://flaim.ncsa.uiuc.edu/downloads/flaim/sample.nfdump.log 


Fig.  1:  The  structure  of  the  risk  estimation  component  (RiskEngine). 
It  is  implemented  as  a  subclass  of  Anony Engine.  The  BasicPreproces- 
sor  and  BasicPostprocessor  classes  were  extended  with  interfaces  to 
the  RiskEngine.  Anony Alg  and  each  of  its  subclasses  now  implement 
a  method  for  estimating  the  probability  P(r|s)  for  each  anonymiza¬ 
tion  primitive 


SRC_IP 

DST_IP 

SRC_PRT 

DST_PRT 

BYTES 

SI 

None 

None 

None 

None 

None 

S2 

BM(16) 

BM(16) 

None 

None 

None 

S3 

BM(16) 

BM(16) 

C 

C 

None 

S4 

BM(16) 

BM(16) 

C 

C 

NA(10%) 

S5 

BM(24) 

BM(24) 

c 

c 

NA(10%) 

S6 

BRP 

BRP 

c 

c 

NA(10%) 

S7 

BM(32) 

BM(32) 

c 

c 

NA(10%) 

TABLE  II:  List  of  the  7  anonymization  scenarios  discussed  in  the 
main  text,  in  order  of  increasing  anonymization  strength.  Legend: 
BM(16)  (BM(24)):  Black  Marker  applied  on  the  16  (24)  least- 
significant  bits.  C:  Classify:  bins  ports  below  1024  in  one  bin  and 
ports  greater  or  equal  to  1024  in  another.  NA(10%):  Noise  Addition: 
adds  zero  averaged  Gaussian  noise  with  a  standard  deviation  equal  to 
10%  of  the  value  to  anonymize.  BRP:  Binary  Random  permutation: 
maps  each  IP  into  a  randomly  generated  IP  in  a  consistent  way  (all 
IPs  equal  in  the  original  log  file  are  also  equal  in  the  anonymized 
log  file).  For  more  details  about  these  transformations  see  Ref.  [3]. 


file  available  on  the  FLAIM  website.  The  nfdump  module 
provided  in  FLAIM  is  used  for  parsing  the  log  file.  We 
considered  a  subset  of  the  available  fields:  the  source  and 
destination  IPs,  the  source  and  destination  ports  and  the 
number  of  bytes  in  a  flow.  The  seven  scenarios  are  summarized 
in  Table  II. 

Each  anonymization  primitive  has  its  corresponding  func¬ 
tion  for  calculating  the  probability  P(r\s).  For  the  sake  of 
simplicity  we  assumed  that  the  different  fields  are  independent. 
Therefore,  in  the  example  above,  P(r\s)  reads: 

P(r\s)  =  P{t\s)src_ip  *  P{As)dst_ip  * 

•P(r\s)sRC_PRT  •  P{t\s)dst_prt  •  P{t\s)bytes 


Risk  as  the  percentage  of  expected  correct  matches 


Fig.  2:  Entropy-based  risk  for  the  7  anonymization  scenarios  de¬ 
scribed  in  Table  II 

Figure  2  shows  the  expected  number  of  correct  matches, 
Ecm  ,  as  a  percentage  of  the  total  number  of  records  for  the 
seven  scenarios.  Intuitively,  increasing  the  number  and  strength 
of  the  anonymization  methods  leads  to  a  reduced  disclosure 
risk.  In  more  details,  we  observed  that  removing  the  last  8 
bits  in  the  IP  addresses  have  no  impact  on  the  estimated 
risk  (Scenarios  1  and  2).  Similarly  suppressing  the  16  or  the 
24  least- significant  bits  of  the  IP  addresses  lead  to  similar 
risk  values  (Scenarios  4  and  5).  This  indicates  that,  in  this 
sample,  most  of  the  IP  addresses  sharing  the  same  first  octet 
are  actually  the  same  address  (however,  the  corresponding  port 
is  not  necessarily  the  same).  In  other  words,  due  to  a  lack  of 
diversity,  most  of  the  IPs  can  be  identified  by  their  first  8  most- 
significant  bits.  By  generalizing  the  port  number  (Scenario  3), 
we  observed  a  ~  36%  decrease  in  the  risk,  suggesting  that 
port  re-coding  could  be  a  valuable  anonymization  strategy  in 
this  context.  Adding  random  noise  on  the  number  of  packets 
transmitted  gives  a  further  ~  8%  decrease  in  the  risk  (Scenario 
4).  To  obtain  low  risk  values,  we  needed  to  remove  most  of  the 
information  contained  in  IP  addresses  by  either  using  a  one-to- 
one  mapping  into  a  predefined  set  (binary  random  permutation, 
scenario  6)  or  black  marking  all  the  32  bits  of  the  address 
(BM(32)  in  Scenario  7). 

V.  Summary 

The  advantage  of  using  Shannon’s  entropy  as  a  measure 
of  disclosure  risk  for  log  file  release  is  twofold:  First,  it 
can  be  applied  to  any  general  masking  transformation,  unlike 
k-anonymity  measure,  which  is  limited  to  non-perturbative 
masking  transformations.  Second,  it  only  depends  on  the  shape 
of  the  probability  distribution;  thus  it  is  less  sensitive  to 
random  fluctuations  than  measures  where  decoding  of  the 
masked  record  is  needed. 

The  main  technical  issue  is  that  computing  the  probability 
of  re-identification  can  be  hard  for  complex  masking  trans¬ 
formations  [15].  Furthermore,  these  probabilities  depend  on 
the  attack  scenarios  (attacker’s  knowledge,  data  sensitivity, 
etc  . . . ),  that  are  often  difficult  to  model  and  application- 
specific.  In  the  simple  example  we  presented  here,  we  could 


easily  derive  these  probabilities  under  the  assumptions  of 
independence  among  fields  and  records.  Both  these  hypotheses 
are  unrealistic  in  many  real  world  scenarios,  such  as  in  port 
scanning  attack,  where  multiple  ports  are  scanned  in  sequence 
on  a  single  target  host.  Further  analysis  is  needed  to  investigate 
the  viability  of  this  approach  in  realistic  settings. 
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Integration  of  Context  into 
Data  Analysis  and 
Visualization 


Ashley  Thomas,  Uday  Banerjee 

Secu  reworks 


Existing  approaches  to  Analysis 


Existing  workflow  in  a  typical  environment 

•  Mostly  analyze  data  from  separate  sources 
(IDS/IPS/Firewall/Syslog/etc.)  in  a  semi-integrated  textual  view  (SIM). 

°  Although  the  view  may  be  integrated,  typically  the  correlation  is 
left  up  to  the  analyst.  This  is  typically  a  complex  task,  demanding 
continuously  high  levels  of  cognition,  and  may  lead  to  incomplete 
analyses. 

•  Analysts  have  to  reference  other  tools  (IDS  signature  details,  packet 
captures,  historical  information,  etc.)  to  make  the  proper 
determination) 


Existing  approaches  to  Analysis 
(contd.) 


Most  commercial  environments  are  SLA  driven,  so  no  motivation 
to  use  ‘yet  another  tool’  (read  ’visualization’)  to  perform  analysis. 

o  Millions  of  alerts  per  day 
o  High  rate  of  false  positives  from  alerts  in  the  field 
o  Limited  number  of  analysts 
o  Time  spent  on  each  alert  is  very  limited 
■  quality  of  analysis  affected 


Existing  approaches  to  Analysis  -  Data 
Visualization 


A  well  studied  field: 

•  Several  tools  documented  here:  http://www.vizsec.org/applications 


Visualization  has  faced  problems  with  getting  adapted  into  a  typical  analyst’s 
workflow 

•  Tool  is  not  purpose  built  for  the  environment 

•  Flexibility  (is  not  always  there  to  build  your  own  visualizations) 

•  Performance  (of  viz  tools  is  very  important.  A  slow  tool  is  going  to  be 
abandoned  sooner  or  later) 

•  Gives  the  analyst  a  ‘free  flow’  exploration  of  the  data,  but  depends  on 
him/her  for  finding  the  needle  in  the  haystack.  There  is  a  need  for  some 
additional  context  to  be  provided  to  the  analyst. 

•  Most  systems  just  allow  for  exploration  of  data,  but  do  not  allow  for 
inferences  to  be  translated  into  ‘work  done'.  In  a  typical  commercial 
environment,  SLAs  dictate  workflows,  and  the  ROI  on  a  given  tool 
(investment  =  time  spent  as  part  of  analysis,  return^ inference  that  other 

Is  in  the  workflow  did  not  give  us)  needs  to  be  very  high  in  order  to 
ome  a  standard  part  of  the  workflow. 


Cross  Platform  Data  Analysis 


The  ultimate  goal  is  to  have  a  unified  data  set  that  can  be 
analyzed  across  different  services,  devices,  applications,  etc. 

•  Normalize  data  from  different  sources  (IDS  alerts,  traffic  flows,  firewall 
and  application  logs,  etc.) 

•  Extract  context  where  applicable  and  present  to  the  Analyst 

•  Visualize  this  data  and  present  the  ‘Big  Picture’ 

•  Allow  the  Analyst  to  resolve  these  events  in  the  visualization  GUI  itself 


A  sample  alert:  analysis 


[**]  [1 :648:7]  SHELLCODE  x86  NOOP  [**] 

[Classification:  Executable  code  was  detected]  [Priority:  1] 

08/09-1 5:46:51 .632771  1 92. 1 68. 1 .1 21 :54835  ->  1 92.1 68.1 . 1 36:80 
TCP  TTL:64  TOS:OxO  ID:3403  lpLen:20  DgmLen:457  DF 
***AP*“  Seq:  0x1E0C3C55  Ack:  0xB33C734D  Win:  0x5C  TcpLen:  32 
TCP  Options  (3)  =>  NOP  NOPTS:  1221541188  17773996 
[Xref  =>  http://www.whitehats.com/info/IDS181] 

•  An  example  alert: 
o  Server  vulnerable? 
o  False  positive? 
o  Attempt  successful? 


More  context;  Better  analysis 


Flow  record  right  after 

08/09-15:47:12  192.168.1.136  209.185.243.135  TCP  2255  21 
2666  15MB  <$nip> 

FTP  download  by  the  server  from  an  unknown  site  -  suspicious. 


An  integrated  platform 


•  Correlating  alerts,  flows,  logs  into  the  same  platform 

o  More  context;  better  analysis 

•  Ability  to  visualize  data  flexibly  (Analyst  can  override  default 
visualizations  and  create  new  ones  -  e.g.  Bar  Chart  over  Pie 
Graph) 

•  Ability  to  drill-down/up  based  on  time,  ip  address,  other 
variables 

•  Provides  guidance  (via  predefined  rules) 

•  Integration  of  the  analysis  and  taking  action  (ability  for  the 
Analyst  to  resolve  events  via  the  visualization  interface) 


Architecture 


Architecture:  Processing  Engine 


•  Ability  to  integrate  and  correlate. 

•  Ability  to  zoom-in 

•  Plug-in  architecture: 

Each  record  type  that  is  supported  will  be  handled  by  an  appropriate 
plug-in. 

o  IDS  alert  plug-in 

■  isensor  IPS 

■  Snort  IDS 

■  cisco 

■  mcafee 

o  netflow  record  plug-in 
o  Firewall  plugin. 

This  plugin  is  aware  of  the  formatting  of  each  type  of  record. 
Finally  when  it  is  stored  into  the  DB  it  is  stored  in  a  consistent 
fashion. 


Architecture:  Rule  Engine 


•  Provides  guidance 

o  Simple  predefined  rules  search  the  data  for  the  existence 
of  certain  conditions,  and  highlight  certain  records  or 
flows  in  order  to  provide  guidance  to  the  Analyst  if 
applicable.  Some  examples  below: 

■  TCP  Syn  packets  to  external  addresses  on  port  135/139/445 

■  Change  in  threshold  of  flow  activity  (>50%)  for  a  given  host  in  a 
time  window 

■  Outbound  activity  to  port  25  to  Yahoo,  AOL,  Hotmail,  etc.  mail 
servers 

■  Traffic  directed  to  bogon  IP  addresses 

•  Temporary  Store: 

o  Data  subset  for  a  certain  window  of  time,  e.g.  (now  -  2  hours  ago). 
This  may  be  the  data  the  analyst  will  work  on. 


Architecture:  Visualization  interface 


•  Flexible,  fast  interface  that  allows  drill-down/up  capability  and 
the  ability  to  assign  a  determination  to  the  result  set 

•  Consists  of  a  'parameter'  section  that  allows  the  Analyst  to 
shape  the  data  set  to  be  visualized  (basically  creating  a  SQL 
query) 

•  Once  this  query  is  submitted,  the  resultant  data  set  is 
visualized  using  a  set  of  default  templates 


Architecture:  Visualization  interface 
(contd.) 


•  The  Analyst  has  the  flexibility  to  change  these  default 
visualizations  to  something  they  feel  could  be  more 
appropriate. 

•  R  (www.r-project.org)  was  our  first  choice  to  display  the 
graphics 

°  Areas  of  investigation:  Interactive  images  (Image  Maps) 
that  allow  for  'click  and  drill  down',  better  suited  packages 
to  display  some  relationships  (Lattice  for  portscans,  etc.) 
o  Commercial  tools  exist  that  do  a  very  good  job  of 
visualizing  data  (but  external  development  can  be  an 
issue)  (e.g.  www.advizorsolutions.com, 
www.vizsec.org/applications/commercial-applications) 


Architecture:  Doing  work 


*  The  tool  also  enables  the  analyst  to  take  action  from  the 
same  GUI  front  end. 

o  This  may  improve  efficiency  and  speed  of  analysis 
o  Allows  the  Analyst  to  resolve  events  in  a  larger  scale 

■  Mark  all  events  from  a  source  IP  as  benign  (e.g. 
known  scanner) 

■  Escalate  all  events  from  a  given  source  IP 
(established  to  be  a  known  bad  IP  after  analysis). 
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Background 


■  CIAC  provides  24x7  “on-call”  operational  cyber  security 
services  to  the  Department  of  Energy  (DOE) 

■  ClAC’s  Mission: 

•  Prevent  cyber  incidents  whenever  possible 

•  Perform  predictive  analysis  to  Watch  and  Warn  for 
any  real  or  potential  threats  to  DOE 

•  Assist  in  the  Response  and  restoration  of  operations 
should  and  incident  occur 

■  CIAC  collaborates  with  local  site  security  personnel  and 
other  cyber  security  agencies 
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Motivation  for  Identifying  Network  Beaconing 


■  We  seek  additional  indicators  of  malware  infection  to 
support  proactive  incident  detection  as  well  as  to 
supplement  incident  response  and  forensics  efforts. 

■  Analysis  of  previously  identified  incidents  has 
uncovered  network  sessions  sharing  common 
characteristics  that  recur  at  regular  intervals.  We 
identify  this  as  “network  beaconing  activity.” 
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Network  Beaconing  Detection  Strategy 


Our  objective  is  to  detect  the  following  intrusion  scenario: 

■  Malware  delivered  via  phishing  email,  drive-by- 
download,  etc. 

■  Malware  attempts  connection  to  an  unknown  controller 

•  If  controller  is  not  available,  malware  sleeps  for  a 
fixed  duration  and  retries  connection 

We  use  this  retry  interval  as  an  indicator  of  possible 
malware  activity 
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Discovery  Methodology :  Overview 


■  Aggregate  flow  session  summaries  into  bi-directional 
records  and  order  by  start  time 

■  Check  each  session  against  whitelist  criteria 

■  Maintain  a  database  of  inter-session  times  for  each 
source  and  destination  IP;  update  for  each  new 
session 

■  Report  session  groups  that  match  a  threshold  of 
network  beaconing  activity 
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Discovery  Methodology :  Logical  Flow 


•Report  and  prune  stale  records 
•Report  ongoing  records 
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Discovery  Methodology :  Aggregate  Session  Information 


Flow  Record  Database  Record  <6i  Bytes) 


Source  IP 

{Source,  Destination}  IP  (Key) 

Destination  IP 

{Start,  End}  Timestamp 

Protocol 

Session  Count 

Source  Port 

First  Seen  Protocol 

Destination  Port 

Is  Multiple  Protocols 

Source  Bytes 

First  Seen  {Source,  Destination}  Port 

Destination  Bytes 

Is  Multiple  {Source,  Destination}  Ports 

Source  Packets 

{Source,  Destination}  Bytes  Mean 

Destination  Packets 

{Source,  Destination}  Bytes  Std  Dev 

Source  Flags 

{Source,  Destination}  Packet  Count  Total 

Destination  Flags 

{Source,  Destination}  Flags  (Logical  OR) 

Flags  of  1st  Packet  in  Session 

Session  Starting  With  SYN  Count 
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Results :  Qualitative 


Beacons  identified  one  day  of  November,  2007 

57,258  Beacon  Records,  17,706  IPs,  21,224  Src-Dst  IP  Pairs 
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Results :  Quantatative 


■  Prototype  script  using  Perl  +  Berkeley  DB  on  2.8GHz 
Xeon  Processor  processes  -4800  sessions  per  second 

■  Midday  on  a  work  day  in  November  2007: 

•  -500,000  unique  “active”  internal  IP  addresses 
monitored 

•  2,351 ,565  unique  src-dst  pairs  being  tracked 

•  -1GB  disk  space  for  Berkeley  DB  database  files 
(-140M  raw  data  size) 

■  A  week  in  November  2007: 

•  732,959  beacon  records  generated 

- 14,842  unique  source  IPs 
-74,753  unique  destination  IPs 
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Analysis  Methodology :  Incident  Response 


If  compromised  host  is 
identified,  past  beaconing 
behavior  of  host  may  provide 
a  toehold  into  the  start  of  the 
intrusion 

■  If  malicious  IP  is  identified 
(watchlist,  other  intrusion, 
etc),  beaconing  activity  to  that 
IP  may  warrant  additional 
concern. 


Graph  view  of  Netflow  (black),  intrusion 
detection  (red),  and  beaconing  (dotted)  records 
from  a  host. 
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Incident  Detection  :  False  Alarms 


■  Network  beaconing  activity  is  prevalent  in  many 
applications  and  protocols  (NTP,  RSS  Feeds, 
automated  software  patching,  etc) 

•  Can  be  somewhat  mitigated  by  whitelisting  “trusted” 
IP  addresses 

■  Keep-alive  traffic  in  long  lived  sessions  may  appear  as 
beacons 

•  For  TCP  traffic,  we  can  investigate  the  Flags  field 

■  Does  adware  on  a  host  constitute  a  false  alarm?  What 
about  spyware? 


Lawrence  Livermore  National  Laboratory 


UCRL-PRES-236878 


DOE  Computer  Incident  Advisory  Capability  (CIAC) 


11 


Analysis  Methodology :  Incident  Detection 


■  Rank  identified  beacons  by  how  ‘interesting’  they  are 

•  Attempt  to  determine  the  cause  of  the  beaconing 

-  Significantly  helped  by  domain  knowledge  of 
internal  hosts,  software  configuration,  security 
policy,  and  acceptable  use  policy 

■  In  our  experience  of  proactive  investigations,  fewer  than 
5%  of  beacons  investigated  were  determined  to  be 
malicious.  Several  potential  policy  violations  identified. 


Link  Attribute  by  Time  Chart 


a  a  es  x 


Count  of  'BiFlow'  on  'Session' 


links*  from T 


second 


a  — J - . - - - - - J - . - J— 

Z-AjP-r.ZC07  12:30  3-Ap-r.ZQQ7  00:30  3Apr-20a7  12:30  4  ApfJOO 7  OO : 30  4Ap/JOOT  12:30  5ApfJOQ7  OQ:3a  SApr-2DG7  12:30  G  Api-20Q7  00:30  G-A.pr.ZOa7  1 

Time  (GMT) 


|  *  21  6.235.95. 1  44  1200.41.3.16  63.236.14.35  21  2.112.1  88.73  66.246.245.56  | 

Interesting  beaconing  to  5  hosts  worldwide.  Later  explained  by  a  popular  media  player  refreshing  ads. 
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Incident  Detection  :  What’s  Going  On? 


Two  Hosts  beaconing  to  262  hosts  (TCP  2170) 
over  several  hours  with  large  response  bytes. 

[globus] 


Several  hosts  beaconing  to  multiple  destinations 
on  TCP  and  UDP;  some  beacons  never  respond 

[peer  to  peer  download  manager] 


Three  Hosts  beaconing  to  a  host  (TCP  80) 
every  3  hours. 

[i**********  spyware  phoning  home] 

Lawrence  Livermore  National  Laboratory  - 


Seven  Hosts  beaconing  to  3  hosts  (TCP  30000) 
over  several  hours  with  no  response. 

[“canadapost”  shipping  module  ???] 

- It 

DOE  Computer  Incident  Advisory  Capability  (CIAC) 


UCRL-PRES-236878 


13 


Conclusion 


■  Identification  and  analysis  of  network  beaconing  activity 
in  flow  data  was  readily  achievable  in  our  environment. 

■  Network  beaconing  logs  have  provided  us  with 
additional  indicators  that  support  incident  detection  and 
forensics. 

■  A  high  false  positive  rate  hinders  conclusive  findings  in 
the  absence  of  additional  evidence. 

■  When  combined  with  other  available  security  indicators, 
network  beaconing  activity  has  led  to  the  discovery  of 
network  misconfigurations,  policy  violations,  and 
compromises. 
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Useful  resources 


■  Usual  Internet  Metadata  (Whois,  Search  Engines,  etc) 

■  Passive  DNS  Repositories 

■  Detailed  host  usage  information  (server,  desktop, 
honeypot,  etc) 

■  A  really  quick  way  to  slice  and  dice  lots  of  data 
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Abstract — Netflow  is  an  efficient  and  flexible  mechanism  to 
collect  network  data  and  share  them  to  security  applications  that 
require  distributed  knowledge.  As  information  sharing  breeds  the 
danger  of  revealing  user  and  network  information,  anonymiza¬ 
tion  of  Netflow  data  has  to  be  applied  before  they  are  shared. 
To  accomodate  anonymization  needs  we  have  developed  anontooL 
Anontool  allows  per-field  anonymization  up  to  the  NetFlow  layer 
offering  a  wide  range  of  primitives  to  choose  from.  However, 
although  we  have  the  tools  to  perform  anonymization,  work 
needs  to  be  done  on  the  policy  part.  Some  policies  may  be 
proven  weak,  if  a  third  party  can  deanonymize  the  data  and 
reveal  user  information.  We  demonstrate  2  possible  attacks  on 
anonymized  traces.  The  first  one  is  called  active  fingerprinting, 
where  a  malicious  user  injects  packets  that  she  can  identify 
in  an  anonymized  trace  and  thus  reveal  the  mappings  of  IP 
addresses.  The  second  one  is  the  disclosure  of  web  pages  that 
users  access  based  on  the  flow  sizes  recorded  in  a  trace.  We  also 
present  solutions  to  these  attacks  along  with  two  anonymization 
primitives  which  we  have  implemented  in  anontool  in  order  to 
defend  against  such  attacks,  to  the  extent  possible. 

L  Introduction 

As  computer  networks  evolve  and  grow  in  size,  the  need 
for  distributed  network  management  and  monitoring  becomes 
more  important  than  ever.  Network  activity  log  sharing  has 
gained  significant  popularity  recently,  not  only  among  com¬ 
puter  security  engineers  and  administrators,  but  also  among 
researchers,  developers  and  educators.  To  accommodate  this 
increasingly  popular  need  for  information  sharing  as  well  as 
the  fundamental  lack  of  trust  between  members  of  different 
communities,  several  tools  have  emerged  which  enable  their 
users  to  anonymize  potentially  sensitive  information  within 
those  logs. 

A  popular  format  used  by  network  activity  logs  is  the  Cisco 
NetFlow  [1]  format.  The  NetFlow  format  is  based  on  the 
concept  of  a  flow,  which  Cisco  has  defined  as  a  set  of  packets 
that  have  the  following  five  properties  in  common:  source  and 
destination  IP  address,  source  and  destination  port  numbers,  as 
well  as  the  IP  protocol  field  value.  The  most  recent  evolution 
of  the  Netflow  format  is  version  9,  which  is  currently  the 
basis  of  the  IETF  [2]  standard  for  information  export.  Given 
this  fact,  NetFlow  is  likely  to  gain  even  more  in  popularity. 

Many  diverse  tools  and  techniques  have  been  implemented 
for  anonymization  purposes.  Most  of  them,  however,  are  cus¬ 
tomized  for  specific  purposes  or  provide  limited  functionality. 


Our  approach,  anontool ,  is  a  general-purpose  tool  that  can 
anonymize  live  or  stored  traffic.  Using  Anontool,  a  user  can 
choose  from  a  large  variety  of  functions  to  use  on  each  and 
every  application  field,  to  implement  her  anonymization  policy 
of  choice.  Anontool  supports  a  number  of  protocols,  but  in  this 
work  we  will  focus  on  the  NetFlow  protocol. 

Despite  what  tools  one  might  use,  the  final  result  log  of 
the  anonymization  process  is  likely  to  be  publicly  available. 
There  is  always  the  potential  for  an  adversary  to  be  able 
to  infer  a  large  amount  of  information  from  the  log,  if  the 
anonymization  policy  is  not  chosen  carefully.  We  will  describe 
a  few  scenarios  where  an  adversary  can  manipulate  the  log 
anonymization  process  in  order  to  deduce  useful  information 
about  the  original  trace.  Given  those  attack  scenarios,  the 
need  for  techniques  to  defend  against  them  becomes  clear. 
We  therefore  describe  two  anonymization  primitives  which 
can  aid  in  protecting  against  such  attacks.  A  proof-of-concept 
implementation  of  them  has  already  been  incorporated  into  the 
latest  version  of  AnontooL 

II.  Anontool  Description 

Anontool  is  a  command  line  tool  which  enables  users  to 
anonymize  both  live  and  stored  traffic.  Its  functionality  is 
based  upon  the  Anonymization  API  [3],  in  short  AAPI.  AAPI 
allows  users  to  write  their  own  anonymization  applications. 
They  can  define  which  anonymization  function  be  applied  on 
any  field,  having  complete  freedom  in  deciding  their  policy. 
It  provides  a  large  set  of  anonymization  primitives,  from 
setting  fields  to  constant  values  and  performing  basic  mapping 
functions  to  prefix-preserving  anonymization  and  several  hash 
functions  and  block  ciphers,  as  well  as  support  for  regular 
expression  matching  and  replacement.  AAPI  can  operate  on 
a  wide  variety  of  protocols,  ranging  from  Ethernet  to  HTTP 
and  FTP  in  the  application  layer.  All  protocol  fields  are  being 
made  available  to  the  user  application. 

AAPI  has  been  implemented  as  a  user-level  library  in  the 
C  language;  it  provides  function  calls  for  creating  packet 
’’streams”,  filtering  using  BPF  filters,  and  of  course  applying 
anonymization  functions.  One  of  its  main  design  goals  was  to 
accomodate  extensibility,  and  potential  developers  are  able  to 
write  their  own  protocol  decoders  similar  to  the  ones  already 
available  for  HTTP  or  NetFlow  protocols,  such  as  SMTP,  or 


their  own  anonymization  primitives.  It  is  also  straightforward 
to  write  code  that  supports  new  input  sources  with  few  code 
additions. 

Since  the  NetFlow  format  for  packet  export  continuously 
gains  on  popularity  [4],  [5],  [6]  we  have  decided  to  extend 
AAPI,  and  subsequently  Anontool,  with  support  of  the  Cisco 
NetFlow  packet  export  format.  We  took  advantage  of  A  API’s 
extensibility  and  implemented  decoding  and  anonymization 
functions  for  both  version  5  and  the  newly  defined  version 
9  of  the  NetFlow  format.  We  take  full  advantage  of  the 
template-based  nature  of  the  NetFlow  v9  format,  to  accurately 
provide  the  user  with  complete  control  of  every  field  made 
available  from  information  export  nodes,  such  as  Cisco  routers 
or  network  monitoring  applications  that  support  the  NetFlow 
export  format,  even  in  the  event  NetFlow  templates  change 
during  a  monitoring  period. 

Anontool  is  a  fairly  simple  C  application  that  makes  use  of 
the  AAPI  library  to  support  anonymization  of  packet  traces. 
It  does  not  implement  any  anonymization  functions  in  itself; 
it  is  much  more  transparent  and  less  error  prone  for  all 
the  anonymization  functionality  to  reside  inside  the  AAPI 
implementation.  It  provides  users  the  choice  of  protocols  and 
functions  to  apply  in  order  to  create  their  anonymization 
policy.  Anontool  also  implements  some  functionality  such  as 
preprocessing  a  trace  to  extract  information  which  may  be 
consequently  used  in  the  actual  anonymization  process;  we 
will  explain  in  further  detail  in  Section  IV. 

III.  Attacks  Against  NetFlow  Anonymization 
Policies 

In  this  section,  we  describe  in  detail  two  attacks  against 
conventional  anonymization  policies  and  outline  related  work 
in  both  packet  traces  and  flows. 

We  assume  that  the  anonymization  process  is  applied  onto 
NetFlow  traces,  which  are,  in  the  general  case,  generated  from 
a  router  at  the  border  of  a  monitored  network,  and  export 
information  for  traffic  entering  and  exiting  this  particular 
network.  The  network  could  be  of  any  size  or  topology;  from  a 
small  home  network  to  larger  networks  belonging  to  research 
institutes,  universities,  and  so  on. 

In  our  threat  model,  we  assume  an  active  adversary  that 
is  able  to  direct  traffic  to  the  monitored  network  at  will, 
has  knowledge  of  the  address  space  it  occupies  and  can 
potentially  compromise  hosts  inside  it.  We  assume  a  rational 
attacker,  for  whom  it  is  less  costly,  or  more  useful  even,  to 
“probe”  and  profile  the  monitored  network  before  mounting 
attacks  against  it.  The  adversary  may  also  have  several  external 
hosts  under  her  command.  She  is  also  able  to  gain  access 
to  the  anonymized  traces,  which  will  most  likely  be  publicly 
released. 

The  first  attack,  which  we  call  “Active  Fingerprinting ”, 
aims  to  break  the  mapping  algorithm  when  used  on  IP  ad¬ 
dresses.  Mapping  takes  the  set  of  IP  addresses  in  a  trace  and 
performs  a  simple  mapping  function  onto  another  totally  dif¬ 
ferent  set.  The  second  set  may  be  the  output  of  a  deterministic 
function  seeded  by  a  random  quantity,  such  as  the  drand48() 


family  of  functions,  or  a  very  simple  sequential  assignment 
of  unique  IP  address  numbers  which  results  in  a  one-to-one 
mapping. 

The  second  attack  aims  at  using  the  information  about 
flow  sizes  contained  inside  NetFlow  traces  in  order  to  deduce 
information  about  either  hosts  inside  the  monitored  network  or 
hosts  that  may  be  outside  it,  such  as  their  IP  address,  network 
usage  profiles,  etc.  We  name  this  attack  “Statistical  Signature 
Inference ”. 

A.  Active  Fingerprinting 

The  idea  that  active  fingerprinting  exploits  is  that  the  map¬ 
ping  between  real  and  anonymized  IP  addresses  is  one-to-one. 
Consequently,  if  the  mapping  on  one  flow  is  discovered,  the 
mapping  on  the  whole  trace  is  compromised.  This  attack  has 
been  described  on  packet  traces  in  [7],  [8]  and  we  demonstrate 
its  applicability  on  NetFlow  records  below. 

Using  this  idea,  an  adversary  can  establish  flows  from  a  host 
under  her  control,  which  resides  outside  the  victim  network, 
to  one  or  more  victim  hosts  inside  it.  These  flows  will  appear 
in  the  anonymized  trace.  The  challenge  for  the  adversary  is  to 
construct  those  flows  in  such  a  way  that  they  will  be  easily 
distinguished  in  the  final  trace.  This  can  be  accomplished  in 
a  variety  of  ways;  she  can  craft  a  flow  with  specific  attributes 
which  are  known  not  to  be  anonymized  (the  list  is  as  large 
as  the  potential  fields  listed  in  a  NetFlow  record,  and  may 
usually  include  TCP  flags,  IP  ToS,  and  so  forth),  or  in  the 
unlikely  scenario  where  flows  are  fully  anonymized,  she  can 
use  temporal  patterns  which  are  easy  to  detect.  Even  a  specific 
packet  size  can  be  used  as  an  identifier  for  the  packets  involved 
in  a  dictionary  attack.  If  the  traces  from  the  monitored  network 
span  a  wide  enough  time  period,  the  latter  attack  is  very 
feasible  as  the  trace  contains  a  large  number  of  attack  packets. 

A  trivial  way  an  adversary  could  create  these  flows  is  to 
perform  a  SYN  scan  on  the  victim  network’s  address  space. 
In  this  case,  even  if  there  is  a  clear  temporal  pattern  which  is 
easily  detectable  in  the  anonymized  trace,  it  can  be  defeated 
in  an  easy  way.  Setting  only  the  SYN  bit  in  TCP  flags  and 
setting  the  number  of  bytes  to  a  specific  quantity  makes  the 
adversary  unable  to  distinguish  live  hosts  from  unused  address 
space. 

We  discuss  a  more  general  measure  to  defend  against  this 
kind  of  attack  in  Section  IV. 

B .  Statistical  Signature  Inference 

The  idea  behind  this  kind  of  attack  is  that  each  web  page 
has  a  unique  and  complex  enough  structure  which  allows  them 
to  be  identifiable  despite  our  best  efforts  to  anonymize  their 
presence  in  NetFlow  logs  and  preserve  useful  information  in 
them  as  well. 

A  naive  first  effort  would  be  the  following.  Consider  the 
web  sites  interesting.com  and  newssite.com,  and  that  a  web 
session  with  each  of  them  is  n  and  m  bytes  long,  respectively. 
The  adversary  can  use  one  of  the  hosts  under  her  command 
to  initiate  a  web  session  to  these  sites  and  view  the  NetFlow 
records  for  source  and  destination  IP  addresses,  port  numbers. 


and  the  total  size  of  traffic  exchanged.  Assuming  web  page 
sizes  do  not  radically  differ  from  one  session  to  another,  and 
that  NetFlow  data  records  TCP  traffic  in  its  entirety,  it  is 
possible  to  filter  out  the  set  of  web  browsing  sessions  from 
an  anonymized  trace  and  construct  a  frequency  histogram  with 
the  number  of  bytes  transferred  in  each  flow.  According  to  our 
assumptions  above,  it  is  possible  to  see  a  great  deal  of  flows 
around  the  values  of  n  and  m.  The  adversary  can  employ 
the  same  tactic  to  find  those  flows,  and  then  gather  further 
information  about  hosts  inside  the  monitored  network,  which 
can  then  be  used  to  answer  questions  such  as:  “What  web 
sites  does  host  A  visit? “ Which  hosts  do  frequently  visit 
www.google.com ?”,  and  make  user  profiles. 

Past  work  [8]  has  demonstrated  this  kind  of  attack  on  packet 
traces.  Recent  work  [9]  has  extended  and  demonstrated  this 
attack  on  NetFlow  logs  as  well.  We  argue  that  the  fundamental 
property  of  web  sessions  that  allows  this  kind  of  analysis  to 
be  exploited  is  the  fact  that  web  sessions  to  different  hosts 
produce  flows  with  similar  characteristics,  especially  in  flow 
size.  In  Section  IV  we  are  going  to  view  our  proposal  of  a 
primitive  which  deals  with  this  issue. 

IV.  Countermeasures 

The  previous  section  described  two  attacks  for  revealing 
sensitive  information  from  an  anonymized  NetFlow  trace.  In 
this  section,  we  will  describe  our  proposed  ways  to  deal  with 
the  aforementioned  attacks,  and  evaluate  their  consistency. 

A.  Bidirectional  Mapping 

We  propose  a  way  to  deal  with  the  issue  that  does  not 
iteratively  consider  all  the  combination  of  fields  an  adversary 
may  use  to  craft  her  flows.  Instead,  we  aim  to  eliminate 
the  one-to-one  mapping  property  without  losing  all  of  the 
information  the  trace  can  provide.  To  this  goal,  we  propose 
a  bidirectional  mapping  to  be  used,  that  is  different  mapping 
for  each  traffic  direction.  Let  A  be  the  IP  address  of  a  live 
host  inside  the  monitored  network,  and  B  the  IP  address  of 
a  host  outside  the  monitored  network.  Conventional  mapping 
functions  would  map  a  flow  (A,  B)  to  (A”,  B”)  and  a  flow  ( B , 
A)  to  (B”,  A ”).  Bidirectional  mapping  maps  a  different  address 
to  A  according  to  the  direction  of  the  flow  that  involves  it.  In 
the  case  of  our  example,  the  flow  (A,  B)  would  be  mapped  to 
(A”  B”)  yet  the  flow  (B,  A)  would  receive  a  different  mapping, 
say  (C,  D). 

Using  this  anonymization  scheme  we  prevent  the  attacker 
from  identifying  her  own  network  flows  inside  the  anonymized 
trace.  Thus,  it  is  made  impossible  to  correlate  her  data  with 
the  trace  information  and  reveal  any  sensitive  data  from  it. 
Also,  most  of  the  statistic  information  derived  from  the  trace 
remains  the  same.  We  can  still  gather  information  about  the 
incoming  and  outgoing  traffic  of  the  organization  and  identify 
the  producers  and  the  consumers  of  the  network.  Correlation 
of  incoming  and  outgoing  traffic  for  a  specific  IP  address  can 
not  be  done,  but  we  argue  that  this  is  a  general  trade-off  of 
the  anonymization  process  and  is  up  to  the  organization  to 


decide  whether  to  sacrifice  sensitivity  for  usability  in  the  data 
it  makes  public. 

The  implementation  of  such  a  primitive  is  quite  easy,  and 
it  is  already  included  in  the  stable  version  of  Anontool. 

B .  Random  Value  Shifting 

In  order  to  diminish  the  viability  of  a  statistical  identification 
approach,  but  still  be  able  to  calculate  some  basic  representa¬ 
tive  statistics  about  a  NetFlow  log,  we  propose  a  randomized 
shifting  of  values,  which  we  will  describe  below. 

Given  a  NetFlow  data  field  with  a  given  value  range,  our 
intent  is  to  “scramble”  its  values  across  the  NetFlow  log  to 
the  point  that  we  make  an  adversary  unable  to  distinguish 
between  two  web  sessions  with  the  same  web  site  and  two 
web  sessions  with  web  sites  that  have  similar  web  pages  in 
size,  but  not  as  much  as  to  destroy  all  the  useful  information 
a  log  may  provide.  More  specifically,  we  intend  to  preserve 
metrics  such  as  arithmetic  averages  and  standard  deviation, 
as  well  as  other  descriptive  statistics.  On  the  other  hand,  we 
wish  to  obfuscate  inferential  statistics,  so  that  an  adversary 
would  be  unable  to  reach  conclusions  that  extend  beyond  the 
immediate  data  alone. 

For  clarity,  we  are  going  to  use  the  flow  size  NetFlow  field 
as  an  example  from  here  on. 

Our  method  is  to  add  to  the  value  of  the  flow  size  field 
a  random  value.  This  value  is  chosen  uniformly  at  random 
from  a  fixed  range  [-d,  d].  One  of  the  basic  properties  of 
our  choice  is  it  allows  us  to  directly  preserve  the  arithmetic 
average  and  standard  deviation  of  the  original  distribution  of 
flow  size  values.  The  parameter  d  may  be  chosen  arbitrarily, 
but  we  will  demonstrate  the  importance  of  an  educated  choice 
with  an  example.  Consider,  as  an  elementary  example,  three 
flows  with  sizes  of  75,  77  and  25  bytes,  which  repeatedly 
occur  within  a  NetFlow  log.  Choosing  d  to  be  equal  to  two, 
this  is  what  happens  in  the  anonymized  trace:  The  flows  with 
the  initial  size  of  75  now  occur  with  flow  sizes  from  thirteen 
to  seventeen.  Flows  with  the  size  of  77  bytes  now  occur  with 
sizes  from  fifteen  to  nineteen.  These  two  groups  of  flows  are 
now  ’’mixed  up”;  what  happens  is  that  the  confidence  intervals 
for  the  random  variables  which  represent  the  flow  sizes  of 
each  flow  are  now  different,  and  they  overlap.  On  the  other 
hand,  that  is  not  the  case  for  the  flow  with  size  25  bytes.  It 
now  occurs  on  the  anonymized  trace  with  values  from  23  to 
27.  An  adversary  is  still  able  to  distinguish  this  flow  from 
the  other  two  with  relative  ease.  Now  we  can  easily  conclude 
that  a  proper  choice  for  the  parameter  d  will  have  to  take 
the  entirety  of  the  NetFlow  log  into  consideration.  This  is  an 
interesting  topic  for  future  work,  which  we  will  not  further 
explore  in  the  rest  of  the  paper  due  to  lack  of  space. 

To  verify  our  assumptions  about  the  descriptive  statistics  of 
a  NetFlow  log  being  preserved  after  the  application  of  random 
value  shifting,  we  implemented  it  in  Anontool  and  proceeded 
to  process  a  NetFlow  packet  trace  with  it.  Our  choice  for  the 
parameter  d  was  the  minimum  flow  size  observed,  divided 
by  2.  As  we  previously  mentioned,  this  is  most  likely  not 
a  good  choice  for  real  world  applications,  but  it  is  good 


V.  Conclusions  and  Future  Work 


-  Anonymized  data 

-  Non-Anonymized  data 

Fig.  1.  Cumulative  distribution  function  of  flow  sizes 


enough  for  the  experimental  evaluation  we  describe.  We  then 
calculated  the  arithmetic  average  and  the  standard  deviation 
for  both  the  original  and  the  anonymized  trace,  which  we 
present  here.  The  NetFlow  trace  spanned  a  time  period  of  three 
minutes  and  a  bit  more  than  150.000  bytes  transferred.  Table  I 
presents  the  values  calculated  for  both  traces.  We  can  see 
that  the  average  and  standard  deviation  do  not  largely  differ. 
This  supports  our  initial  hypothesis,  that  we  can  preserve 
some  amount  of  general  information  about  NetFlows  in  the 
trace  even  after  performing  random  value  shifting.  Figure  1 
presents  the  cumulative  distribution  function  of  the  distribution 
of  flow  sizes  in  the  original,  and  the  anonymized  traces,  for 
comparison  and  reference.  As  we  can  see  the  distribution 
remains,  almost,  identical  after  the  anonymization  process. 
This  enforces  our  initial  argument  that  the  information  that 
can  be  derived  by  the  anonymization  process  are  not  affected 
by  the  value  scrambling. 


Trace 

Average  Flow  Size  (bytes) 

Std.  Deviation 

Original 

Anonymized 

1843.69 

1845.02 

7336 

7335.52 

TABLE  I 

Some  basic  descriptive  statistics  regarding  a  NetFlow  trace 

BEFORE,  AND  AFTER  ANONYMIZATION. 


We  described  two  scenarios  where  an  attacker  can,  by 
manipulating  the  anonymization  process,  deduce  useful  infor¬ 
mation  about  non-anonymized  data  in  NetFlow  traces.  Those 
attacks  have  already  been  carried  out  in  packet  traces,  and 
we  have  shown  their  applicability  on  NetFlow  logs  as  well. 
In  order  to  protect  against  these  types  of  attacks,  we  have 
introduced  two  anonymization  primitives  and  discussed  their 
use  and  parameterization  in  order  to  make  educated  choices 
about  anonymization  policies.  We  also  provided  data  which 
suggest  their  use  still  preserves  useful  data  about  NetFlow 
logs,  without  exposing  inferential  statistics  to  potential  adver¬ 
saries. 

VI.  Availability 

Anontool  can  be  downloaded  from  http://dcs.ics.forth.gr/ 
Activities/Projects/anontool.html.  The  application  has  been 
installed  and  tested  on  RedHat  and  Debian  distributions  of 
the  Linux  operating  system. 
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Currently,  Anontool  performs  some  basic  trace  pre¬ 
processing  when  random  value  shifting  is  going  to  be  used. 
It  processes  all  packets  in  a  trace  in  order  to  extract  the 
information  it  needs  to  calculate  d.  This  information  is  depen¬ 
dent  on  the  field  that  random  value  shifting  is  being  applied 
on.  After  the  value  of  d  is  calculated  and  chosen  according 
to  the  user’s  method  of  choice,  the  actual  anonymization 
process  takes  place.  It  is  possible  to  estimate  d  during  the 
anonymization  process,  however,  as  knowledge  of  the  whole 
trace  is  impossible  to  have  until  the  whole  trace  has  been 
processed,  we  believe  the  estimated  value  will  not  yield  as 
good  a  result  as  when  a  trace  can  be  preprocessed  and  the 
value  of  d  calculated  on  it. 


